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(57) Abstract: The invention provides improved CDMA, WCDMA (UTMS) or other spread spectrum communication systems 
of the type that processes one or more spread-spectrum waveforms, each representative of a waveform received from a respective 
user (or other transmitting device). The improvement is characterized by a first logic element that generates a residual composite 
spread-spectrum waveform as a function of an arithmetic difference between a composite spread-spectrum waveform for all users 
(or other transmitters) and an estimated spread-spectrum waveform for each user. It is further characterized by one or more second 
logic elements that generate, for at least a selected user (or other transmitter), a refined spread -spectrum waveform as a function of 
a sum of the residual composite spread -spectrum waveform and the estimated spread-spectrum waveform for that user. 
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Wireless Commmunications Methods and Systems for Long-Code and Other Spread 
Spectrum Waveform Processing 

Background of the Invention 

5 

This application claims the benefit of priority of (i) US Provisional Application Serial 
No. 60/275,846 filed March 14, 2001, entitled "Improved Wireless Communications Systems 
and Methods"; (ii) US Provisional Application Serial No. 60/289,600 filed May 7, 200 1, 
entitled "Improved Wireless Communications Systems and Methods Using Long-Code Multi- 
10 User Detection"' and (iii) US Provisional Application Serial Number. 60/295,060 filed June 1, 
2001 entitled "Improved Wireless Communications Systems and Methods for a Communica- 
tions Computer," the teachings all of which are incorporated herein by reference. 

The invention pertains to wireless communications and, more particularly, by way of 
15 example, to methods and apparatus providing multiple user detection for use in code division 
multiple access (CDMA) communications. The invention has application, by way of non-lim- 
iting example, in improving the capacity of cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communica- 
20 tions. It is a form of multiplexing communications, e.g., between cellular phones and base 
stations, based on distinct digital codes in the communication signals. This can be contrasted 
with other wireless protocols, such as frequency-division multiple access and time-division 
multiple access, in which multiplexing is based on the use of orthogonal frequency bands and 
orthogonal time-slots, respectively. 

30 

A limiting factor in CDMA communication and, in particular, in so-called direct 
sequence CDMA (DS-CDMA) communication, is the interference between multiple cellular 
phone users in the same geographic area using their phones at the same time, which is referred 
to as multiple access interference (MAI). Multiple access interference has an effect of limiting 
35 the capacity of cellular phone base stations, driving service quality below acceptable levels 
when there are too many users. 

A technique known as multi-user detection (MUD) is intended to reduce multiple 
access interference and, as a consequence, increases base station capacity. It can reduce inter- 
40 ference not only between multiple transmissions of like strength, but also that caused by users 
so close to the base station as to otherwise overpower signals from other users (the so-called 
near/far problem). MUD generally functions on the principle that signals from multiple simul- 
taneous users can be jointly used to improve detection of the signal from any single user. Many 
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forms of MUD are discussed in the literature; surveys are provided in Moshavi, "Multi-User 
Detection for DS-CDMA Systems," IEEE Communications Magazine (October, 1996) and 
Duel-Hallen et al 5 "Multiuser Detection for CDMA Systems," IEEE Personal Communications 
(April 1995). Though a promising solution to increasing the capacity of cellular phone base 
5 stations, MUD techniques are typically so computationally intensive as to limit practical appli- 
cation. 

An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
10 detection or interference cancellation in code-division multiple access communications. 

A further related object is to provide such methods and apparatus as provide improved 
short-code and/or long-code CDMA communications. 

15 A further object of the invention is to provide such methods and apparatus as can be 

cost-effectively implemented and as require minimal changes in existing wireless communica- 
tions infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
20 multi-user detection and related algorithms in real-time. 

A still further object of the invention is to provide such methods and apparatus as 
manage faults for high-availability. 

30 



35 



40 
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Summary of the Invention 



Wireless Communication Systems And Methods For Long-code Com- 
munications For Regenerative Multiple User Detection Involving 
5 Implicit Waveform Subtraction 



The foregoing and other objects are among those attained by the invention which pro- 
vides, in one aspect, an improved spread-spectrum communication system of the type that 
processes one or more spread-spectrum waveforms, e.g., a CDMA transmissions, each repre- 

1 0 sentative of a waveform received from, or otherwise associated with, a respective user (or other 
transmitting device). The improvement is characterized by a first logic element, e.g., operating 
in conjunction with a wireless base station receiver and/or modem, that generates a residual 
composite spread-spectrum waveform as a function of a composite spread-spectrum waveform 
and an estimated composite spread-spectrum waveform. It is further characterized by one or 

15 more second logic elements that generate, for at least a selected user (or other transmitter), a 
refined matched-filter detection statistic as a function of the residual composite spread-spec- 
trum waveform generated by the first logic element and a characteristic of an estimate of the 
selected user's spread-spectrum waveform. 



20 Related aspects of the invention as described above provide a system as described 

above in which the first logic element comprises arithmetic logic that generates the composite 
spread-spectrum waveform based on a relation 



30 



wherein 

r m } W is th e residual composite spread-spectrum waveform, 
35 r[t] represents the composite spread-spectrum waveform, 

r"\t] represents the estimated composite spread-spectrum waveform, 



40 



/ is a sample time period, and 
n is an iteration count, 
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The estimated composite spread-spectrum waveform, according to further related 
aspects, can be pulse-shaped and based on estimated complex amplitudes, estimated symbols, 
and codes encoded within the user waveforms. 



5 Still farther aspects of the invention provide improved spread-spectrum communica- 

tion systems as described above in which the one or more second logic elements comprise rake 
logic and summation logic, which generate the refined matched-filter detection statistic for at 
least the selected user based on a relation 

10 rM = 4 fl)2 -^N+AW 

wherein 

represents an amplitude statistic, 

15 

h[ n) [m] represents a soft symbol estimate for the ^ user for the symbol 
period , 

represents a residual matched-filter detection statistic for the IP user, 
20 and 



n is an iteration count. 



Further related aspects of the invention provide improved systems as described above 
30 wherein the refined matched-filter detection statistics for each user is iteratively generated. 
Related aspects of the invention provide such systems in which the user spread-spectrum 
waveform for at least a selected user is generated by a receiver that operates on long-code 
CDMA signals. 



35 Further aspects of the invention provide a spread spectrum communication system, e.g., 

of the type described above, having a first logic element which generates an estimated compos- 
ite spread-spectrum waveform as a function of estimated user complex channel amplitudes, 
time lags, and user codes. A second logic element generates a residual composite spread-spec- 
trum waveform a function of a composite user spread-spectrum waveform and the estimated 

40 composite spread-spectrum waveform. One or more third logic elements generate a refined 
matched-filter detection statistic for at least a selected user as a function of the residual com- 
posite spread-spectrum waveform and a characteristic of an estimate of the selected user's 
spread-spectrum waveform. 
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A related aspects of the invention provides such systems in which the first logic element 
generates the estimated re-spread waveform based on a relation 



P^W-itZ«&-¥-/^]<^W^»lLr/JViJ| 



wherein 

10 ^ v is a number of simultaneous dedicated physical channels for all users, 

5[/] is a discrete-time delta function, 

a£ is an estimated complex channel amplitude for the /7 th multipath component 
15 for the P user, 

c k M represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and aj factor associated with even 
numbered dedicated physical channels, 



20 



30 



40 



b[ n) [m] represents a soft symbol estimate for the A* user for the symbol 
period, 

T fcp is an estimated time lag for the /?* multipath component for the user , 



N k is a spreading factor for the IP user, 
/ is a sample time index, 
35 L is a number of multi-path components., 

N c is a number of samples per chip, and 
n is an iteration count 



Related aspects of the invention provide systems as described above wherein the first 
logic element comprises arithmetic logic that generates the estimated composite spread-spec- 
trum waveform based on the relation 
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r 

wherein 

5 

r in) [i] represents the estimated composite spread-spectrum waveform, 

g[t] represents a raised-cosine pulse shape. 

20 Related aspects of the invention provide such systems that comprise a CDMA base sta- 

tion, e.g., of the type for use in relaying voice and data traffic from cellular phone and/or 
modem users. Still further aspects of the invention provide improved spread spectrum com- 
munication systems as described above in which the user waveforms are encoded using long- 
code CDMA protocols. 

15 

Still other aspects of the invention provide methods multiple user detection in a spread- 
spectrum communication system paralleling the operations described above. 

Wireless Communication Systems And Methods For Long-code Com- 
20 munications For Regenerative Multiple User Detection Involving 

Matched-filter Outputs 

Further aspects of the invention provide an improved spread spectrum communication 
system, e.g., of the type described above, having first logic element operating in conjunction 

30 with a wireless base station receiver and/or modem, that generates an estimated composite 
spread-spectrum waveform as a function of user waveform characteristics, e.g., estimated 
complex amplitudes, time lags, symbols and code. The invention is further characterized by 
one or more second logic elements that generate for at least a selected user a refined matched- 
filter detection statistic as a function of a difference between a first matched-filter detection 

3^ statistic for that user and an estimated matched-filter detection statistic — the latter of which is 
a function of the estimated composite spread-spectrum waveform generated by the first logic 
element. 

Related aspects of the invention as described above provide for improved wireless 
40 communications wherein each of the second logic elements generate the refined matched-filter 
detection statistic for the selected user as a function of a difference between (i) a sum of the first 
matched-filter detection statistic for that user and a characteristic of an estimate of that user's 
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spread-spectrum waveform, and (ii) the estimated matched-filter detection statistic for that user 
based on the estimated composite spread-spectrum waveform. 

Further related aspects of the invention provide systems as described above in which 
5 the second logic elements comprise rake logic and summation logic which generates refined 
matched-filter detection statistics for at least a selected user in accord with the relation 

10 wherein 

A[ n)2 represents an amplitude statistic, 

h^\m\ represents a soft symbol estimate for the 1& user for the mth symbol 
15 period, 

y[ n) [m\ represents the first matched-filter detection statistic, 

y ( ^]kl m ] represents the estimated matched-filter detection statistic, and 

20 

n is an iteration count. 



Other related aspects of the invention include generating the refined matched-filter 
detection statistic for the selected user and iteratively refining that detection statistic zero or 
30 more times. 

Related aspects of the invention as described above provide for improved wireless 
communications methods wherein an estimated composite spread-spectrum waveform is based 
on the relation 

,W - Rejg ag* • ^"p W lrN c + x£ + mT k ] • £ M J > 

wherein 



40 L is a number of multi-path components, 



is an estimated complex channel amplitude for the /7 th multipath component 
for the 4 th user, 
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N k is a spreading factor for the user, 

r"\t] represents the estimated composite spread-spectrum waveform, 
5 N c is a number of samples per chip, 

x£ } is an estimated time lag for the multipath component for the 1& user, 
m is a symbol period, 

10 

?i is a data bit duration, 

n is an iteration count, and 

15 c to [r] represents a user code comprising at least a scrambling code, an orthogo- 

nal variable spreading factor code, and a j factor associated with even numbered 
dedicated physical channels. 



20 



30 



35 



40 



Wireless Communication Systems And Methods For Long-code Com- 
munications For Regenerative Multiple User Detection Involving Pre- 
maximal Combination Matched Filter Outputs 

Still further aspects of the invention provide improved-spread spectrum communica- 
tion systems, e.g., of the type described above, having one or more first logic elements, e.g., 
operating in conjunction with a wireless base station receiver and/or modem, that generate a 
first complex channel amplitude estimate corresponding to at least a selected user and a 
selected finger of a rake receiver that receives the selected user waveforms. One or more 
second logic elements generate an estimated composite spread-spectrum waveform that is a 
function of one or more complex channel amplitudes, estimated delay lags, estimated symbols, 
and/or codes of the one or more user spread-spectrum waveforms. One or more third logic 
elements generate a second pre-combination matched-filter detection statistic for at least a 
selected user and for at least a selected finger as a function of a first pre-combination matched- 
filter detection statistic for that user and a pre-combination estimated matched-filter detection 
statistic for that user 

Related aspects of the invention provide systems as described above in which one or 
more fourth logic elements generate a second complex channel amplitude estimate corre- 
sponding to at least a selected user and at least selected finder. 
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Still further aspects of the invention provide systems as described above in which the 
third logic elements generate the second pre-combination matched-filter detection statistic for 
at least the selected user and at least the selected finger as a function of a difference between (i) 
the sum of the first pre-combination matched-filter detection statistic for that user and that 
5 finger and a characteristic of an estimate of the selected user's spread-spectrum waveform and 
(ii) the pre-combination estimated matched-filter detection statistic for that user and that 
finger. 

Related aspects of the invention as described above provide for the first logic elements 
1 0 generating a complex channel amplitude estimated corresponding to at least a selected user and 
at least a selected finger of a rake receiver that receives the selected user waveforms based on 



a relation 



15 




wherein 



^kp is a complex channel amplitude estimate corresponding to the p* 1 finger of 
the ]& user, 



20 



v\{s] is a filter, 



N p is a number of symbols, 



30 



y£ I™] is a first pre-combination matched-filter detection statistic correspond- 
ing to the /?* finger of the ** user for the m ih symbol period, 



M is a number of symbols per slot, 



35 




[m] represents a soft symbol estimate for the A* user for the symbol 



period, 



m is a number symbol period index, 



40 



s is a slot index, and 



n is an iteration count. 
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Further related aspects of the invention as described above provide for one or more 
second logic elements, each coupled with a first logic element and using the complex channel 
amplitudes generated therefrom to generate an estimated composite re-spread waveform based 
on the relation 



40 



Z2 

*=1 pal r 

wherein 



10 K v is a number of simultaneous dedicated physical channels for all users, 

5[t] is a discrete-time delta function, 

a { £ is an estimated complex channel amplitude for the /?* multipath component 
15 for the A* user, 

c k [r] represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels, 

20 

h^\m\ represents a soft symbol estimate for the k* user for the nfl symbol 
period, 

*(») . 

is an estimated time lag for the jjP multipath component for the IP user , 

30 

N k is a spreading factor for the ft 01 user, 
/ is a sample time index, 
35 L is a number of multi-path components., 

N c is a number of samples per chip, and 
n is an iteration count. 



Further related aspects of the invention provide systems as described above in which 
the second logic element comprises arithmetic logic that generates the estimated composite 
spread-spectrum waveform based on a relation 
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r (fl) W = E?Wp (n) p-r] 

r 

wherein 

r (n) [t] represents the estimated composite spread-spectrum waveform, 
g[t] represents a pulse shape. 



JO Still further related aspects of the invention provide systems as described above in 

which the third logic elements comprise arithmetic logic that generates the second pre-combi- 
nation matched-filter detection statistic based on the relation 



15 



20 



30 



wherein 

represents the pre-combination matched-filter detection statistic for 
the finger for the A* user for the m* 1 symbol period, 

is the complex channel amplitude for the p th finger for the user, 

h^\m] represents a soft symbol estimate for the user for the m* 1 symbol 
period, 

y^l™] represents the first pre-combination matched-filter detection statistic 
for the pP finger for the ft* user for the m th symbol period, 



y& t kpi m ] represents the pre-combination estimated matched-filter detection 
35 statistic for the p* finger for the ]& user for the symbol period, 

and 



n is an iteration count. 



40 Still further aspects of the invention provide methods of operating multiuser detector 

logic, wireless base stations and/or other wireless receiving devices or systems operating in the 
manner of the apparatus above. Further aspects of the invention provide such systems in which 
the first and second logic elements are implemented on any of processors, field programmable 
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gate arrays, array processors and co-processors, or any combination thereof. Other aspects of 
the invention provide for interatively refining the pre-combination matched-filter detection 
statistics zero or more time. 

5 Other aspects of the invention provide methods for an improved spread-spectrum com- 

munication system as the type described above. 
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40 
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Brief Description of the Illustrated Embodiment 

A more complete understanding of the invention may be attained by reference to the 
drawings, in which: 

5 

Figure 1 is a block diagram of components of a wireless base-station utilizing a multi- 
user detection apparatus according to the invention. 

Figure 2 is a detailed diagram of a modem of the type that receives spread-spectrum 
I o waveforms and generates a baseband spectrum waveform together with amplitude and time lag 
estimates as used by the invention. 

Figures 3 and 4 depict methods according to the invention for multiple user detection 
using explicitly regenerated user waveforms which are added to a residual waveform. 

15 

Figure 5 depicts methods according to the invention for multiple user detection in 
which user waveforms are regenerated from a composite spread-spectrum pulsed-shaped 
waveform. 

20 Figure 6 depicts methods according to the invention for multiple user detection using 

matched-filter outputs where a composite spread-spectrum pulse-shaped waveform is rake- 
processed. 

Figure 7 depicts methods according to the invention for multiple user detection using 
30 pre-maximum ratio combined matched-filter output, where a composite spread-spectrum 
pulse-shaped waveform is rake processed. 

Figure 8 depicts an approach for processing user waveforms using full or partial decod- 
ing at various time-transmission intervals based on user class. 

35 

Figure 9 depicts an approach for combining multi-path data across received frame 
boundaries to preserve the number of multi user detection processing frame counts. 

Figure 10 illustrates the mapping of rake receiver output to virtual to preserve spreading 
40 factor and number of data channels across multiple user detection processing frames where the 
data is linear and contiguous in memory. 
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Figure 11 depicts a long-code loading implementation utilizing pipelined processing 
and a triple-iteration of refinement in a system according to the invention; and 

Figure 12 illustrates skewing of multiple user waveforms. 

5 
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Detailed Description of the Illustrated Embodiment 

Code-division multiple access (CDMA) waveforms or signals transmitted, e.g., from a 
user cellular phone, modem or other CDMA signal source, can become distorted by, and 

5 undergo amplitude fades and phase shifts due to phenomena such as scattering, diffraction 
and/or reflection off buildings and other natural and man-made structures. This includes 
CDMA, DS/CDMA, IS-95 CDMA, CDMAOne, CDMA2000 IX, CDMA2000 lxEV-DO, 
WCDMA(or UTMS), and other forms of CDMA, which are collectively referred to hereinafter 
as CDMA or WCDMA. Often the user or other source (collectively, "user") is also moving, 

10 e.g., in a car or train, adding to the resulting signal distortion by alternately increasing and 
decreasing the distances to and numbers of building, structures and other distorting factors 
between the user and the base station. 

In general, because each user signal can be distorted several different ways en route to 
15 the base station or other receiver (hereinafter, collectively, "base station"), the signal may be 
received in several components, each with a different time lag or phase shift. To maximize 
detection of a given user signal across multiple tag lags, a rake receiver is utilized. Such a 
receiver is coupled to one or more RF antennas (which serve as a collection point(s) for the 
time-lagged components) and includes multiple fingers, each designed to detect a different 
20 multipath component of the user signal. By combining the components, e.g., in power or 
amplitude, the receiver permits the original waveform to be discerned more readily, e.g., by 
downstream elements in the base station and/or communications path. 

A base station must typically handle multiple user signals, and detect and differentiate 
30 among signals received from multiple simultaneous users, e.g., multiple cell phone users in the 
vicinity of the base station. Detection is typically accomplished through use of multiple rake 
receivers, one dedicated to each user. This strategy is referred to as single user detection 
(SUD). Alternately, one larger receiver can be assigned to demodulate the totality of users 
jointly. This strategy is referred to as multiple user detection (MUD). Multiple user detection 
35 can be accomplished through various techniques which aim to discern the individual user sig- 
nals and to reduce signal outage probability or bit-error rates (BER) to acceptable levels. 

However, the process has heretofore been limited due to computational complexities 
which can increase exponentially with respect to the number of simultaneous users. Described 
40 below are embodiments that overcome this, providing, for example, methods for multiple user 
detection wherein the computational complexity is linear with respect to the number of users 
and providing, by way of further example, apparatus for implementing those and other meth- 
ods that improve the throughput of CDMA and other spread-spectrum receivers. The illus- 
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trated embodiments are implemented in connection with long-code CDMA transmitting and 
receiver apparatus; however those skilled in the art will appreciate that the methods and appa- 
ratus therein may be used in connection with short-code and other CDMA signalling protocols 
and receiving apparatus, as well as with other spread spectrum signalling protocols and receiv- 
ing apparatus. In these regards and as used herein, the terms long-code and short-code are used 
in their conventional sense: the former referring to codes that exceed one symbol period; the 
latter, to codes that are a single symbol period or less. 

• Five embodiments of long-code regeneration and waveform refinement are presented 
herein. The first two may be referred to as a base-line embodiment and a residual signal 
embodiment. The remaining three embodiments use implicit waveform subtraction, matched- 
filter outputs rather than antenna streams and pre-maximum ratio combination of matched- 
filter outputs. It will be appreciated by those skilled in the art, that other modifications to these 
techniques can be implemented that produce the like results based on modifications of the 
methods described herein. 

Figure 1 depicts components of a wireless base station 100 of the type in which the 
invention is practiced. The base station 100 includes an antenna array 114, radio frequency/ 
intermediate frequency (RF/IF) analog-to-digital converter (ADC), multi-antenna receivers 
110, rake modems 112, MUD processing logic 118 and symbol rate processing logic 120, 
coupled as shown. 

Antenna array 1 14 and receivers 1 10 are conventional such devices of the type used in 
wireless base stations to receive wideband CDMA (hereinafter <6 WCDMA") transmissions 
from multiple simultaneous users (here, identified by numbers 1 through K). Each RF/IF 
receiver (e.g. , 1 1 0) is coupled to antenna or antennas 1 1 4 in the conventional manner known in 
the art, with one RF/IF receiver 110 allocated for each antenna 114. Moreover, the antennas 
are arranged per convention to receive components of the respective user waveforms along 
different lagged signal paths discussed above. Though only three antennas 114 and three 
receivers 1 10 are shown, the methods and systems taught herein may be used with any number 
of such devices, regardless of whether configured as a base station, a mobile unit or otherwise. 
Moreover, as noted above, they may be applied in processing other CDMA and wireless com- 
munications signals. 

Each RF/IF receiver 110 routes digital data to each modem 112. Because there are 
multiple antennas, here, Q of them, there are typically Q separate channel signals communi- 
cated to each modem card 112. 
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Generally, each user generating a WCDMA signal (or other subject wireless communi- 
cation signal) received and processed by the base station is assigned a unique long-code code 
sequence for purpose of differentiating between the multiple user waveforms received at the 
basestation, and each user is assigned a unique rake modem 1 12 for purpose of demodulating 

5 the user's received signal. Each modem 112 may be independent, or may share resources from 
a pool. The rake modems 112 process the received signal components along fingers, with each 
receiver discerning the signals associated with that receiver's respective user codes. The 
received signal components are denoted here as ^[t] denoting the channel signal (or wave- 
form) from the 4 th user from the 0 th antenna, or r k [t] denoting all channel signals (or wave- 

10 forms) originating from the fd* 1 user, in which case r k [t] is understood to be a column vector 
with one element for each of the Q antennas. The modems 112 process the received signals 
r k [*] to generate detection statistics y^lm] for the It h user for the mth symbol period. To this 
end, the modems 122 can, for example, combine the components r^t] by power, amplitude or 
otherwise, in the conventional manner to generate the respective detection statistics y^lm] . 

15 In the course of such processing, each modem 112 determines the amplitude (denoted herein as 
a ) of and time lag (denoted herein as T) between the multiple components of the respective 
user channel. The modems 112 can be constructed and operated in the conventional manner 
known in the art, optionally, as modified in accord with the teachings of some of the embodi- 
ments below. 

20 

The modems 1 12 route their respective user detection statistics y^im] , as well as the 
amplitudes and time lags, to common user detection (MUD) 118 logic constructed and oper- 
ated as described in the. sections that follow. The MUD logic 118 processes the received sig- 
nals from each modem 112 to generate a refined output, } [m], or more generally, y^\m\ 9 

30 where n is an index reflecting the number of times the detection statistics are iteratively or 
regeneratively processed by the logic 118. Thus, whereas the detection statistic produced by 
the modems is denoted as yi 0) [m] indicating that there has been no refinement, those generated 
by processing the yl 0) [m] detection statistics with logic 118 are denoted y^[m] 9 those gener- 
ated by processing the y^M detection statistics with logic 118 are denoted /^[m] , and so 

35 forth. Further waveforms used and generated by logic 1 18 are similarly denoted, e.g., r (n) [t] . 

Though discussed below are embodiments in which the logic 1 1 8 is utilized only once, 
i.e., to generate y^[m] from y^im] 9 other embodiments may employ that logic 118 multiple 
times to generate still more refined detection statistics, e.g., for wireless communications appli- 
40 cations requiring lower bit error rates (BER). For example, in some implementations, a single 
logic stage 118 is used for voice applications, whereas two or more logic stages are used for 
data applications. Where multiple stages are employed, each may be carried out using the 
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same hardware device (e.g., processor, co-processor or field programmable gate array) or with 
a successive series of such devices. 

The refined user detection statistics, e.g., y^[fn\ or more generally y^\m\ J are com- 
5 municated by the MUD process 118 to a symbol process 120. This determines the digital 
information contained within the detection statistics, and processes (or otherwise directs) that 
information according to the type of user class for which the user belongs, e.g., voice or data 
user, all in the conventional manner. 

10 Though the discussion herein focuses on use of MUD logic 118 in a wireless base sta- 

tion, those skilled in the art will appreciate that the teachings hereof are equally applicable to 
MUD detection in any other CDMA signal processing environment such as, by way of non- 
limiting example, cellular phones and modems. For convenience, such cellular base stations 
other environments are referred to herein as "base stations." 

15 

Referring to Figure 2, modem 112 receives the channel-signals r[t] 112 from the RF/ 
IC receiver (Figure 1). The signals are first input into a searcher receiver 212. The searcher 
receiver analyzes the digital waveform input, and estimates a time offset xj^ for each signal 
component (e.g. for each finger). As those skilled in the art will appreciate, the "haf ' or A 
20 symbol denotes estimated values. The time offset for each antenna channel is communicated 
to a corresponding rake receiver 214. 

The rake receiver receivers 214 receive both the digital signals r[t] from the RF/IF 
receivers, and the time offsets, x£ } . The receivers 214 calculate the pre-combination matched- 

30 filter detection statistics, y$[m] , and estimate signal amplitude, for each of the signals. 
The amplitudes are complex in value, and hence include both the magnitude and phase infor- 
mation. The pre-combination matched-filter detection statistics, y l £ } [m] , and the amplitudes 
a ¥ for each finger receiver 212, are routed to a maximal ratio combining (MRC) 216 process 
and combined to form a first approximation of the symbols transmitted by each user, denoted 

35 j{ 0) [m] . While the MRC 216 process is utilized in the illustrated embodiment, other methods 
for combining the multiple signals are known in the art, e.g., optimal combining, equal gain 
combining and selection combining, among others, and can be used to achieve the same 
results. 

40 At this point, it can be appreciated by one skilled in the art that each detection statistic, 

y ( ^[m] , contains not only the signal originating from user k, but also has components (e.g., 
interference and noise) that have originated in the channel (e.g., the environment in which the 
signal was propagated and/or in the receiving apparatus itself). Hence, it is further necessary 
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to differentiate each user's signal from all others. This function is provided by the multiple 
user detection (MUD) card 118. 

The methods and apparatus described below provide for processing long-code WCDMA 
5 at sample rates and can be introduced into a conventional base station as an enhancement to the 
matched-filter rake receiver. The algorithms and processes can be implemented in hardware, 
software, or any combination of the two including firmware, field programmable gate arrays 
(FPGAs), co-processors, and/or array processors. 

10 The following discussion illustrates the calculations involved in the illustrated multiple 

user detection process. For the following discussion, and as can be recognized by one skilled 
in the art, the term physical user refers to an actual user. Each physical user is regarded as a 
composition of virtual users. The concept of virtual users is used to account for both the dedi- 
cated physical data channels (DPDCH) and the dedicated physical control channel (DPCCH). 

1 5 There are 1 + virtual users corresponding to the k* physical user, where N dk is the number 
of DPDCHs for the k th user. 



As one with ordinary skill in the art can appreciate, when long-codes are used, the base- 
band received signals , r[t] 9 s which is a column vector with one element per antenna, can be 
20 modeled as: 

Ky 

where / is the integer time sample index, K v is the number of virtual users, T k = N k N c 
30 is the channel symbol duration, which depends on the user spreading factor, N k is the spread- 
ing factor for the k* virtual user, N c is the number of samples per chip, is receiver noise 
and other-cell interference, [t] is the channel-corrupted signature waveform for the k th 
virtual user over the m th symbol period, and b k [m] is the channel symbol for the k th virtual user 
over the m th symbol period. 

35 

Since long-codes extend over many symbol periods, the user signature waveform and 
hence the channel-corrupted signature waveform vary from symbol period to symbol period. 
For L multi-path components, the channel-corrupted signature waveform for the k th virtual 
user is modeled as, 

40 
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where % are the complex multi-path amplitudes. The amplitude ratios (3^ are incor- 
porated into the amplitudes % m One skilled in the art will see that if k and /are virtual users 
corresponding to the DPCCH and the DPDCHs of the same physical user, then, aside from 
scaling by p£ and fy, the amplitudes % and a lp are equal. This is due to the fact that the signal 
waveforms for both the DPCCH and the DPDCH pass through the same channel. 

The waveform s km [*) is referred to as the signature waveform for the k lh virtual user 
over the m th symbol period. This waveform is generated by passing the code sequence [n] 
through a pulse-shaping filter g[t] 9 

^m=Zg[t-rN c ] Ckm [r] (3) 

r=0 

where g[t] is the raised-cosine pulse shape. Since g[t] is a raised cosine pulse as 
opposed to a root-raised-cosine pulse, the received signal r [i] represents the baseband signal 
after filtering by the matched chip filter. The code sequence c^Jr] = c k [r + mN k ] represents 
the combined scrambling code, orthogonal variable spreading factor (OVSF) code and j 
factor associated with even numbered DPDCHs. 

The received signal r[t] which has been match-filtered to the chip pulse is next match- 
filtered by the user long-code sequence filter and combined over multiple fingers. The result- 
ing detection statistic is denoted here asj^m], the matched-filter output for the I th virtual user 
over the m th symbol period. The matched-filter output yfjri\ for the I th virtual user can be writ- 
ten, 



where a lq is the estimate of % a andt,, is the estimate of % Jq , 



(4) 



Because of the extreme computational complexity of symbol-rate multiple user detec- 
tion for long-codes, it is advantageous to resort to regenerative multiple user detection when 
long-codes are used. Although regenerative multiple user detection operates at the sample rate, 
for long-codes the overall complexity is lower than with symbol-rate multiple user detection. 
Symbol-rate multiple user detection requires calculating the correlation matrices every symbol 
period, which is unnecessary with the signal regeneration methods described herein. 

For regenerative multiple user detection, the signal waveforms of interferes are regen- 
erated at the sample rate and effectively subtracted from the received signal. A second pass 
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through the matched filter then yields improved performance. The computational complexity 
of regenerative multiple user detection is linear with the number of users. 

By way of review, the implementation of the regenerative multiple user detection can 
5 be implemented as a baseline implementation. Referring back to the received signal, r[t] : 

r M = XXX - mT k y> k [m] + m{0 

k=\ m p-\ 

=Xim+wm 

10 

L (5) 

OT /7=1 



15 For the baseline implementation, all estimated interference is subtracted yielding a 

cleaned-up signal rj n+J) (>] as follows: 

20 ** 

OT (6) 

The implementation represented by Equation (6) corresponds to a total subtraction of 
30 the estimated interference. One skilled in the art will appreciate that performance can typically 
be improved if only a fraction of the total estimated interference is subtracted (i.e., partial inter- 
ference subtraction), this owing to channel and symbol estimation errors. Equation (6) is easily 
modified so as to incorporate partial interference cancellation by introducing a multiplicative 
constant of magnitude less than unity to the sum total of the estimated interference. When mul- 
35 tiple cancellation stages are used the optimum value of this constant is different for each 
stage. 

The above equations are implemented in the baseline long-code multiple user detection 
process 118 as illustrated in Figure 3. The receiver base-band signal r[t] 122 is input to the 
40 rake receiver cards 112 (i.e., one rake receiver for each user) as described above. Each of the 
rake receivers 112 processes the base-band signal r[t] \22 and outputs the first approximation 
of the transmitted symbol, yf\m\ 304 for each user k (e.g., user 1 through user £), as well as 
the estimated amplitude 9 time lag and user code 306. For ease of notation, here, the 
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superscript refers to the n th regeneration iteration. Hence, for example, refers to the base- 
band because no iterations have been performed. 

The yf\m] 304 output from the rake receiver 112 is input into a detector which out- 
puts hard or soft symbol estimates b ( *\m] used to cancel the effects of multiple access interfer- 
ence (MAI). One skilled in the art will appreciate that many different detectors may be used, 
including the hard-limiting (sign function) detector, the null-zone detector, the hyperbolic tan- 
gent detector and the linear-clipped detector, and that soft detectors (all but the first listed 
above) typically yield improved performance. 

The outputs from the rake receivers 1 12 and the soft symbol estimates are input into a 
respreading process 310 which assembles an estimated spread-spectrum waveform corre- 
sponding to the selected user but without pulse shaping. The re-spread signals are input into 
the raised-cosine filter 3 12 which produces an estimate of the received spread-spectrum wave- 
form for the selected user. 

The raised-cosine pulse shaping process accepts the signals from each of the respread 
processes (e.g., one for each user), and produces the estimated user waveforms r"\t] . Next, 
the waveforms r k [t] are further processed in a series of summation processes 314, 316, 318 
to determine each user's cleaned-up signal rf* l) [t] according to the above equation (6). 

Therefore, for example, to determine the signal corresponding to the 1 st user, the base- 
band signal r[t] 122 from the RF/IF receivers 110 containing information from all simultane- 
ous users is reduced by the estimated signals r"\t] for all users except the 1 st user. After the 
subtraction of the ^\t] signals (e.g., r?[t] through r ( £[t] as illustrated), the remainder 
signal contains predominately the signal for the 1st user. Hence, the summation function 314, 
applies the above equation (6) to produce the cleaned up signal r\ n+n [t]. This process is per- 
formed for each simultaneous user. 

The output from the summation processes 3 14, 3 1 6, 3 1 8 is supplied to the rake receiv- 
ers 320 (or re-applied to the original rake receivers 112). The resulting signal produced by the 
rake receivers 320 is the refined matched-filter detection statistic y { ?[m\. The superscript (1) 
indicates that this is the first iteration on the base-band signal. Hence, the base-line long-code 
multiple user detection is implemented. As illustrated, only one iteration is performed, how- 
ever, in other embodiments, multiple iterations may be performed depending on limitations 
(e.g., computational complexity, bandwidth, and other factors). 
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It can be appreciated by one skilled in the art that the above methods are limited by 
bandwidth and computational complexity. Specifically, for example, if K = 128 , i.e., there are 
128 simultaneous users for this implementation, the total bisection bandwidth is 998.8 Gbytes/ 
second, determined with the following assumption, for example: 

5 

3.84 Mchips / sec / antenna / stream 
x 2 antennas 
x 8 samples / chip 
x 1 bytes / sample 
10 x 128(128 - 1) streams 

= 998.8 Gbytes / sec 

The computational complexity is calculated in terms of billion operations per second 
(GOPS), and is calculated separately for each of the processes of re-spreading, raised-cosine 

15 filtering, interference cancellation (IC), and the finger receiver operations. The re-spread pro- 
cess involves amplitude-chip-bit multiply-accumulate operations (macs). Assuming, for 
example, that there are only four possible chips and further that the amplitude chip multiplica- 
tions are performed via a table look-up requiring zero GOPS, then the re-spread computational 
complexity is the (amplitude-chip)x(bit macs). Therefore, the re-spread computational cost (in 

20 GOPS) is: 

3.84 Mchips / sec / antenna / finger / virtual-user / multiple user detection 
stage 

x 2 antennas 
30 x 4 fingers 

x 256 virtual users 
x 1 multiple user detection stage 
x 4 ops / chip (real x complex mac) 
= 31.5 GOPS 

35 

Based on the same assumptions, the raised-cosine filter requires: 

3.84 Mchips / sec / antenna / physical-user / multiple user detection stage 
x 8 samples / chip 
40 x 2 antennas 

x 128 physical users 

x 1 multiple user detection stage 

x 6 ops / sample / tap (complex additions then real x complex mac) 
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x 24 taps (using symmetry) 
= 1,132.5 GOPS 

The computational cost of the IC process is 

5 

3.84 Mchips / sec / antenna / physical-user / multiple user detection stage 
x 8 samples / chip 
x 2 antennas 
x 128 physical users 
10 x 1 multiple user detection stage 

x 2 ops / sample / physical users (complex add)) 
x 128 users 
-2,013.3 GOPS 

1 5 Finally, the computational complexity for the rake receiver processes is: 

3.84 Mchips / sec / antenna / physical-user / multiple user detection stage 
x 2 antennas 
x 4 fingers 

20 x 256 virtual users 

x 1 multiple user detection stage 
x 8 ops / chip (complex mac) 
= 62.9 GOPS 

30 Summing the separate computational complexities for each of the above processes 

yields the following results: 



Process GOPS 

Re-Spread 31.5 

35 Raised Cosine Filtering 1,132.5 

IC 2,013.3 

Finger Receivers 62.9 

TOTAL 3,240.2 



However, both the bandwidth and computation complexity are reduced by employing a 
residual-signal implementation as now described. The bandwidth can be reduced by forming 
the residual signal, which is the difference between the received signal and the total (i.e., all 
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users and all multi-paths) estimated signal. Then, the cleaned-up signal r* n+1) [/] expressed in 
terms of the residual signal is: 



k 

10 =*f[r]+r£>[r] 



(7) 



This implementation is illustrated in Figure 4. One skilled in the art can recognize that 
through the point of determining the output from the raised-cosine filters, the residual signal 
implementation is identical with that above illustrated within Figure 3. It is at this point, the 
2Q residual signal implementation varies as now described. 

A summation process 402 calculates r™[tj according to equation (7) above by accept- 
ing the base-band signal r[t] and subtracting the signal r (n) [t] (i.e., the output from all of the 
raised-cosine filters 310). 



Differing from the baseline implementation, here, a first summation process 402 is per- 
formed by subtracting from the baseband signal r[t] 122 the output from each raised-cosine 
pulse shaping process 310. This produces the residual signal rj£[t] corresponding to the base- 
band signal and the total (e.g., all users in all multi-paths) estimated signal. 

The residual signal [t] is supplied to a further summation process for each user (e.g., 
404) where the output from that user's raised-cosine pulse shaping process 3 12 is added to the 
r «M signal as described in above equation (7), thus determining the cleaned-up signal 
r ( " +i) [t] for each user. 

Next, as with the baseline implementation, the cleaned-up signal r* w+I) [/] for each user 
is supplied to a rake receiver 320 (or reapplied to 1 1 2) for processing into the resultant yj n+i) [m] 
detection statistics ready for processing by the symbol processor 120. 
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One skilled in the art can recognize that both the bandwidth and computational com- 
plexity is improved (i.e., lowered) for this implementation compared with the base-line imple- 
mentation described above. Specifically, continuing with the assumptions used in determining 
the bandwidth and computational complexity as above and applying those assumptions to the 
5 residual-signal implementation, the bandwidth can be estimated as follows: 

3.84 Mchips / sec / antenna / stream 
x 2 antennas 
x 8 samples / chip 
10 x 1 bytes / sample 

x 129 streams 
= 7.9 Gbytes / sec 

The computational complexity for each of the processes is as follows: the re-spreading 
1 5 and raised-cosine are the same as with the baseline implementation. 

For the IC processes, the computational complexity is: 

3.84 Mchips / sec / antenna / physical-user / multiple user detection stage 
20 x 8 samples / chip 

x 2 antennas 

x 128 physical users 

x 1 multiple user detection stage 

x 2 ops / sample / waveform addition (complex add)) 
30 x 3 waveform additions 

= 47.2 GOPS 

Finally, the finger receiver processes are the same as with the base-line implementation 
above. Therefore, summing the separate computational complexities for each of the above 
35 processes yields the following results: 



Process GOPS 

Re-Spread 31.5 

Raised Cosine Filtering 1,132.5 

40 IC 47.2 

Finger Receivers 62.9 

TOTAL 1,274.1 
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Therefore, both the bandwidth and computational complexity is improved, however, it 
can be recognized by one skilled in the art that even with such improvement, the computational 
complexity may be a limiting factor. 

5 Further improvement is possible and is now described within in the following three 

embodiments, although other embodiments can be recognized by one skilled in the art. One 
improvement is to utilize a implicit waveform subtraction rather than the explicit waveform 
subtraction described for use with both the baseline implementation and the residual long-code 
implementation above. A considerable reduction in computational complexity results if the 

1 0 individual user waveforms are not explicitly calculated, but rather implicitly calculated. 

The illustrated embodiment utilize implicit waveform subtraction by expanding on 
equation (7) above, and using approximations as shown below in equation (8). 




35 lH 



40 



AH 



{ 9=1 



(8) 
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The two approximations used, as indicated within equation (8), include neglecting inter- 
symbol interference terms for the user of interest, and further, neglecting cross-multi-path 
interference terms for the user of interest. Because the user of interest term has a strong deter- 
ministic term, the omission of these low-level random contributions is justified. These contri- 
5 butions could be included in a more detailed embodiment without incurring excessive increases 
in computational complexity. However, implementation computational complexity would 
increase somewhat Such an embodiment may be appropriate for high data-rate, low spreading 
factor users where inter-symbol and cross multi-path term are larger. 

10 A noteworthy aspect of equation (8) above is that the rake receiver operation on the 

estimated user of interest signal r\ n) [t] can be calculated analytically. Thus, the signal need not 
be explicitly formed, but rather, the corresponding contribution is added after the rake receiver 
operation on the residual signal alone. Now referring to Figure 5, this implicit waveform sub- 
traction implementation is illustrated. 

15 

One skilled in the art can glean from the illustration that separate re-spreading and 
raised-cosine processing is no longer performed on each individual user signal, but rather, is 
performed only once on the baseband composite re-spread signal p (/j) [t] , Thus, the re-spread 
process 312 accumulates the composite signal p (n) |>] based on the amplitudes , time lags 
20 and user codes. The output from the re-spreading process produces another composite 

signal r (n) [t] 502 as described below and in equation (9). 

At this point, it is of note that a substantial reduction in computational complexity 
accrues due to not having to explicitly calculate the individual user estimated waveforms. As 
30 illustrated in Figure 5, the individual user waveforms are not required, hence, the composite 
signal p (n) [t] 502 representing the sum of all estimated user waveforms can be formed by cal- 
culating this composite waveform first without performing the raised-cosine filtering process 
on each individual waveform. Only one filtering operation need be performed, which repre- 
sents a substantial reduction in computational complexity. 

35 

The form of p {n) [t] is as follows: 
40 k ° l P => r 

^ } ra=]E>M> w [*-r] 

(9) 
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Now that an understanding of the composite waveform p {n) [t] is accomplished, refer- 
ring back to Figure 5, this waveform is transformed into r (n) [t] via the raised-cosine pulse 
shaping filter 312. From here, a summation process 506 subtracts r {n) [t] from the base-line 
waveform r[t] producing the residual waveform rj£[t] as shown above (e.g., in equation 
5 (7)). 



Unlike the residual signal implementation described above, here, the r£[t] is applied 
directly to the rake receivers 506 (or reapplied to the rake receivers 112) for each user together 
with the user code for that user. The output from each rake receiver is applied to a summation 
process, where the A} n)2 *B" [m] values are added to the rake receiver output as described 
above in equation (8) producing the y} n * l) [m] detection statistics suitable for symbol process- 
ing 120. 

The computational complexity of this embodiment is reduced as now described. The 
re-spread processing and rake receiver computational costs are the same as with the previous 
implementations. However, the raise-cosine filtering and interference cancellation computa- 
tional cost is now: 



For the raised-cosine filtering, 

20 

.3.84 Mchips / sec / antenna / multiple user detection stage 
x 8 samples / chip 
x 2 antennas 

x 1 multiple user detection stage 

30 ^ 
^ x 6 ops / sample (complex addition then real x complex mac) 

x 24 taps (using symmetry) 

= 8.8 GOPS 

The computational cost of the IC process is 

35 

3.84 Mchips / sec / antenna / multiple user detection stage 
x 8 samples / chip 
x 2 antennas 

x 1 multiple user detection stage 
x 2 ops / sample / waveform addition (complex add) 
40 x 1 waveform addition 

= 0.123 GOPS 



29 



WO 02/073937 



PCT/US02/08106 



10 



Summing the separate computational complexities for each of the above processes 
yields the following results: 

Process GOPS 

Re-Spread 31.5 

Raised Cosine Filtering 8.8 

IC 0.1 

Finger Receivers 62.9 

TOTAL 103.3 

Another embodiment using matched-filter outputs rather than antenna streams is now 
presented. This embodiment follows from equation (8) above where the rake receiver outputs 
are: 



15 



(10) 



and further user equation (7) above, equation (10) can be re-written as: 



20 



30 



(H) 



35 



and then, combining equation (11) with equation (8) yields: 



9=1 



(12) 



This embodiment improves the above approaches in that the antenna streams do not 
need to be input into the multiple user detection process, however, it is not possible to re-esti- 
40 mate the channel amplitudes. 

Referring to Figure 6, an illustration of the matched-filter output embodiment is illus- 
trated. As illustrated, the processing of the baseband r[t] waveform is accomplished as 
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described in Figure 5 above, and further, p (/7) [/] is determined in accordance with equation (9) 
and is applied to the raised-cosine pulse shaping process 602. 

Differing from the above embodiment, however, there is no summation process before 
applying r {n) [t] of the second rake receiver process 604. Rather, r (n) [t] is applied directly to 
the rake receiver process 604. The output from the rake receivers 604 is subtracted 606 from 
the output y\ n) \m\ from the first rake receivers 112. This difference is then added to the 
A$ n)1 • [m] value to produce yj n+l) [m] . This process is described within the above equations 
(11) and (12). 



The computational complexity is reduced because there is no longer an explicit interfer- 
ence canceling (IC) operation, and thus, the interference canceling computational cost is zero. 
The rake receiver computational cost is half the previous embodiment's value because now the 
re-estimate of the amplitudes cannot be performed, and there is no need to cancel interference 
15 on the dedicated physical control channel (DPCCH). Therefore, the computational cost is: 

Process GOPS 

Re-Spread 31.5 

Raised Cosine Filtering 8.8 

IC 0.0 

Finger Receivers 31.5 

TOTAL 71.8 



Another embodiment using matched-filter outputs obtained before the maximal ratio 
combination (MRC) is now described. The pre-MRC rake matched-filter outputs can be 
described as: 



2N, U c « u "» l J (13) 
The same detection statistics based on the cleaned up signal r\"* n [t] is 

- W. myelin] (14) 

Now from Equation (7), 

Hence the first-stage pre-MRC matched-filter outnuts can be re-written: 
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A W-^rl ^ + #? + «r f ] • i [«] 

15 ZN ' »=<> • (16) 

where the following approximation has been used, 



Given the pre-MRC matched-filter outputs the re-estimated channel amplitudes are 



30 wherein 

w(>] is a filter, 

iVp is a number of symbols, and 
M is a number of symbols per slot, 
and the post-MRC matched-filter outputs are then 

40 
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(19) 



This embodiment is illustrated in Figure 7. Here, the yjg } [m] detection statistics are 
produced as with the above embodiments, however, before being applied to the MRC, the esti- 
mated amplitude aj° } is determined first. Next, the MRC produces the yj 0) [m] detection sta- 
tistics which are from the amplitudes and the pre-combination matched-filter detection 
statistics y^[m] as in Equation (19) above. 

The r[t] waveform is applied (or reapplied) to a rake receiver 704. The output from the 
rake receiver 704 is subtracted 706 from the y^[m\ detection statistics. Next, the difference 
from the subtraction 706 is summed 708 with the a^bf[m\ value, thus producing y$[m\ in 
accordance with equation (19) above. 

After n iterations are performed, the yj? detection statistics for each of the users cor- 
responding to each antenna has been determined. The detection statistics for each user, y\ n) is 
next determined via estimating the complex amplitudes 710 across the Q channels for that user, 
and performing a maximum ratio combination 712 using those amplitudes. 

It is helpful to understand that although the computational complexity increased, here, 
it is possible to re-estimate channel amplitudes, and hence, cancel interference on the dedicated 
physical control channels (DPCCH). The computational complexity of this embodiment is: 



which is still within a practical range. 

Therefore, as shown in all the embodiments above, and other non-illustrated embodi- 
ments, methods for performing multiple user detection are illustrated. 

Turning now to software implementations for the above, one of several implementa- 
tions is designed to allow full or partial decoding of users at various transmission time intervals 
(TTIs) within the multiple user detection (MUD) iterative loop. The approach, illustrated in 
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Figure 8, allows users belonging to different classes (e.g., voice and data) to be processed with 
different latencies. For example, voice users could be processed with a 10+ ms latency 802, 
whereas data users could be process with an 80+ ms latency 804. Alternately, voice users could 
be processed with a 20+ms latency 806 or a 40+ ms latency 808, so as to include voice decod- 
5 ing in the MUD loop. Other alternatives are possible depending on the implementation and 
limitations of the processing requirements. 

If a particular data user is to be processed with an 80+ ms latency 804 so as to include 
the full turbo decode within the MUD loop then the input channel bit-error rate (BER) pertain- 

10 ing to these users might be extraordinarily high. Here, the MUD processing might be config- 
ured so as to not include any cancellation of the data users within the 10+ ms latency 802. 
These data users would then be cancelled in the 20+ ms latency 806 period. For this cancella- 
tion it could be opted to perform MUD only on data users. The advantage of canceling the 
voice users in the first latency range (e.g., first box) would still benefit the second latency range 

15 processing. 

Alternately, the second box 806 could perform cancellation on both voice and data 
users. The reduced voice channel bit-error rate would not benefit the voice users, whose data 
has already been shipped out to meet the latency requirement, but the reduced voice channel 
20 BER would improve the cancellation of voice interference from the data users. In the case that 
voice and data users are cancelled in the second box 806, another, possible configuration would 
be to arrange the boxes in parallel Other reduced-latency configurations with mixed serial and 
parallel arrangements of the processing boxes are also possible. 

30 Depending on the arrangement chosen, the performance for each class of user will vary. 

The approach above tends to balance the propagation range for data and voice users, and the 
particular arrangement can be chosen to tailor range for the various voice and data services. 

Each box is the same code but configured differently. The parameters that differ are: 

35 

N__FRAMES_RAKE_OUTPUT; 

Decoding to be performed (e.g. repetition decoding, turbo decoding, and the like); 
Classes of users to be cancelled; 
Threshold parameters. 

40 

The pseudo code for the software implementation of one long-code multiple user detec- 
tion processing box is as follows: 
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Initialize 

Zero data 

Generate OVSF codes 

Generate raised cosine pulse 

Allocate memory 



Open rake output files 
Open mod output files 
Align mod data 

Main Frame Loop { 

Determine number of physical users 
Read_in_rake_output_records (N frames) 
Reformat_rake_output_data (N frames at a time) 
for stage ~ 1 : N_stages 

Perform appropriate decoding(SRD, turbo, and the like, depending on TTI) 

Perform Jong_code_mud 

15 end 
} 

Free memory 

The following four functions are described below: 

20 

Read injrake output records; 
Reformat_rake_output_data 

Perform appropriate decoding(SRD, turbo, and the like, "depending on TTI); 
Perform_long_code_mud. 

30 

The Readjto_rake_output_records function performs: 
Reading in data for each user; and 
Assigning data structure pointers. 



35 



The rake data transferred to MUD is associated with structures of type Rake_output_ 
datajype. The elements of this structure are given in Table 1. There is a parameter N_ 
FRAMES J*AKEJ3UTPUT with values { 1, 2, 4, 8 } that specifies the number of frames to 
be read-in at a time. The following table tabulates the Structure Rakeoutputbufjype ele- 
ments: 



40 



Element Type 
unsigned long 
unsigned long 



Name 

Frame_number 
physical_user_code_number 
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int 


physical_user_tfci 


int 


physical_user_sf 


int 


physical_user_beta_c 


int 


physical_user_beta_d 


int 


N_dpdchs 


int 


compressed_modejaag 


int 


compressed_mode_frame 


int 


Nfirst 


int 


TGL 


int 


slotformat 


int 


N rake fingers 


int 


N_antennas 


unsigned long 


mpath_offset[N_ANTENNAS] 


unsigned long 


tau_offset 


unsigned long 


yoffset 


COMPLEX* 


mpath[N_ANTENNAS] 


unsigned long* 


tauhat 


float* 


y_data 



20 It is helpful to describe several structure elements for a complete understanding. The 

element slotformat is an integer from 0 to 11 representing the row in the following table 
(DPCCH fields), 3 GPP TS 25.21 1. By way of non-limiting example, when slot Jformat = 3, it 
maps to the fourth row in the table corresponding to slot format 1 with 8 pilot bits and 2 TPC 
bits. The offset values (e.g. tau offset) give the location in memory relative to the top of the 

30 structure where the corresponding data is stored. These offset values are used for setting the 
corresponding pointers (e.g. tau hat). For example, if Rbuf is a pointer to the structure then: 

Rbuf->taujiat = (unsigned long*)( (unsigned long)Rbuf + Rbuf->tau_offset ); 

35 is used to set the tau hat pointer. 



The rake output structure associated data (mpath, tau_hat and y data) is ordered as fol- 
lows: 

mpath[n][q + s * L] = amplitude data 

taujiat[q] = delay data 

y_data[ 0 + m * M ] = DPCCH data for symbol period m 

y_data[ l+j+(d-l)*J + m * M ] = dth DPDCH data for symbol period m 
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where 



n 


= antenna index (0 : Na-1) 


q 


- finger index (0 : L-l) 


s 


= slot index (0 : Nslots- 1) 


m 


= symbol index (0 : 149) 


j 


= bit index (0 : J-l) 


d 


= DPDCH index (1 : Ndpdchs) 


Na 


= N_ANTENNAS 


L 


= NRAKEFINGERSMAX 


Nslots 


= NSLOTSPERFRAME = 


J 


= 256/SF 


M 


= 1 +J*Ndpdchs. 



The memory required for the rake output buffers is dominated by the y-data memory 
1 5 requirement. The maximum memory requirement for a user is Nsym *(l+64*6) floats per 
frame, where Nsym = 150 is the number of symbols per frame. This corresponds to 1 DPCCH 
at SF 256 and 6 DPDCHs at SF 4. If 128 users are all allocated to this memory then possible 
memory problems arise. To minimize allocation problems, the following table gives the maxi- 
mum number of user that the MUD implementation will be designed to handle at a given SF 
20 and Ndpdchs. 



30 
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SF 


INdpdcbs 


IN uraber 
users 


Bits per 
symbol 


Mean bits 
per symbol 


256 




256 


2 


4.0 


128 




192 


3 


4.5 


64 




128 


5 


5.0 


32 




96 


9 


6.8 


16 




64 


17 


8.5 


8 




32 


33 


8.3 


4 




16 


65 


8.1 


4 


2 


12 


129 


12.1 


4 


3 


8 


193 


12.1 


4 


4 


4 


257 


8.0 


4 


5 


3 


321 


7.5 


4 


6 


2 


385 


6.0 



In the proceeding table, the Bits per symbol = 1 + ( 256 / SF ) * NJDPDCHs, Mean bits 
40 per symbol = (Number users) * (Bits per symbol) / 128, and Ndpdchs = Number DPDCHs. 

From the above table it is noted that the parameter specifying the mean number of bits 
per symbol be set to MEAN_BITS_PER_SYMBOL = 16. The code checks to see if the physi- 
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cal users specifications are consistent with this memory allocation. Given this specification, the 
following are estimates for the memory required for the rake output buffers. 



Data 


Type 


Size 


Count 


Count 


Bytes 


Rake_output_buf 


Structure 


88 


1 


1 


88 


rapath 


COMPLEX 


8 


Lmax * Nslots * Na 


240 


1,920 


tau 


int 


4 


Lmax 


8 


32 


y 


float 


4 


Nsym * Nbits 


2400 


9,600 




COMPLEX 


8 


Nsym * Nbits * Lmax * Na 




307,200 



Total bytes per user per frame 3 1 8,840 



Total bytes for 128 users and 9 frames 367 Mbytes 

Where Count is the per physical user per frame, assuming numeric values based on: 

Lmax = N_RAKE_FINGERS_MAX =8 

Nslots = NSLOTSPERFRAME = 15 

Na = NANTENNAS = 2 

Nsym = N_SYMBOLSJPERJFRAME =150 

Nbits = MEANBITSPERSYMBOL =16 

The location of each structure is stored in an array of pointers 

Rake_output_buf[User + Frame Jdx * NUSERSJVTAX] 

where Framejdx varies- from 0 to NJFRAMESRAKEOUTPUT inclusive. Frame 0 
is initially set with zero data. After all frames are processed, the structure and data correspond- 
ing to the last frame is copied back to frame 0 and NFRAMESRAKEOUTPUT new struc- 
tures and data are read from the input source. 

The Reformat_rake_output_data function performs: 

Combining of multi-path data across frame boundaries; 

Determines number of rake fingers for each MUD processing frame 

Filling virtual-user data structures 

Separates DPCHs into virtual users 

Determines chip and sub-chip delays for all fingers 

Determines the minimum SF and maximum number of DPDCHs for each user 
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Reformats user b-data to correspond to the minimum SF 
Reformats rake data to be linear and contiguous in memory. 

Interference cancellation is performed over MUD processing frames. Due to multi-path 
5 and asynchronous users, the MUD processing frame will not correspond exactly with the user 
frames. MUD processing frames, however, are defined so as to correspond as closely as pos- 
sible to user frames. It is preferable for MUD processing that the number of multi-path returns 
be constant across MUD processing frames. The function of multi-path combining is to format 
the multi-path data so that it appears constant to the long-code MUD processing function. 
10 Each time after N = N_FRAMES_RAKE_OUTPUT frames of data is read from the input 
source the combining function is called. 

Figure 9 shows a hypothetical set of multi-path lags corresponding to several frames of 
user data 902. Also shown are the corresponding MUD processing frames 904. Notice that 
1 5 MUD processing frame k overlaps with user frames k-1 and k. For example, processing frame 
1 906 overlaps with user frame 0 908, and further, overlaps with user frame 1910. The MUD 
processing frame is positioned so that this is true for all multi-paths of all users. A one-symbol 
period corresponds to a round trip for a 10 km radius cell. Hence even large cells are typically 
only a few symbols asynchronous. 

20 

The multi-path combining function determines all distinct delay lags from user frames 
k-1 and k. Each of these lags is assigned as a distinct multi-path associated with MUD process- 
ing frame k, even if some of the distinct lags are obviously the same finger displaced in delay 
due to channel dynamics. The amplitude data for a finger that extends into a frame where the 

30 finger wasn't present is set to zero. The illustrated thin lag-lines (e.g., 912) represent finger 
amplitude data that is set to zero. After the tentative number of fingers is assessed in this way, 
the total finger energy that falls within the MUD processing frame is assessed for each tentative 
finger and the top NJRAKE_FINGERS_MAX fingers are assigned. In the assignment of fin- 
gers the finger indices for fingers that were active in the previous MUD processing frame are 

35 kept the same so as not to drop data. 

The user SF and number of DPDCHs can change every frame. It is helpful for efficient 
MUD processing that the user SF and number of DPDCHs be constant across MUD processing 
frames. This function, Reformatjrake_output_data formats the user b-data so that it appears 
40 constant to the long-code MUD processing function. Each time after N = NJPRAMES_ 
RAKE OUTPUT frames of data is read from the input source this function is called. The 
function scans the N frames of rake output data and determined for each user the minimum SF 
and maximum number of DPDCHs. Virtual users are assigned according to the maximum 
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number of DPCHs. If for a given frame the user has fewer DPCH the corresponding b-data and 
a-data are set to zero. 

Note that this also applies to the case where the number of DPDCHs is zero due to inac- 
5 tive users, and also to the case where the number of DPCHs is zero due to compressed mode. 
It is anticipated that the condition of multiple DPDCHs will not often arise due to the extreme 
use of spectrum. If for a given frame the SF is greater than the minimum the b-data is expanded 
to correspond to the lower SF. That is, for example, if the minimum SF is 4, but over some 
frames the SF is 8, then each SF-8 b-data bit is replicated twice so as to look like SF-4 data. 
10 Before the maximum ration combination (MRC) operation the y-data corresponding to 
expanded b-data is averaged to yield the proper SF-8 y-data. 

Figure 10 shows how rake output data is mapped to (virtual) user data structures. Each 
small box (e.g., 1002) in the figure represents a slot's-worth of data. For DPCCH y-data or b- 

15 data, for example, each box would represent 150 values. Data is mapped so as to be linear in 
memory and contiguous frame to frame for each antenna and each finger. The reason for this 
mapping is that data can easily be accessed by adjusting a pointer. A similar mapping is used 
for other data except the amplitude data, where it would be imprudent to attempt to keep the 
number of fingers constant over a time period of up to 8 frames. For the virtual-user code data 

20 there are generally 38,400 data items per frame; and for the b-data and y-data there are gener- 
ally 150 x 256 / SF data items per frame. 

Note that for pre-MRC y-data, the mapping is linear and contiguous in memory for 
each antenna and each finger. Each DPCH is mapped to a separate virtual user data structure. 

30 The initial conditions data (frame 0 1 004) is initially filled with zero data (except for the codes). 
After frame N data is written, this data is copied back to frame 0 1004, and the next frame of 
data that is written is written to frame 1 1006. For all data types the 0-index points to the first 
data item written to frame 0 1004. For example, the initial-condition b-data (frame 0) for an 
SF 256 virtual user is indexed b[0], b[l], . . ., b[149], and the b-data corresponding to frame 1 

35 isb[150],b[151],...,b[299]. 

Four indices are of interest: chip index, bit index, symbol index, and slot index. The 
chip index r is always positive. All indices are related to the chip index. That is, for chip index 
r we have 

40 

Chip index = r 

Bit index =r/Nk 
Symbol index = r / 25 6 
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Slot index = r/2560 

where Nk is the spreading factor for virtual user k. 

The elements for the (virtual) user data structures are given in the following table along 
5 with the memory requirements. 



35 
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Element Type 


Name 


Bytes 


Bytes • 


int 


Dpch_type 


4 


4 


int 


Sf 


4 


4 


int 


log2Sf 


4 


4 


float 


Beta 


4 


4 


int 


Mrc_bit_idx 


4 


4 


int 


N_bits_per_dpch 


4 


4 


int 


N_rake_fmgers[Nf] 


4*8 


32 


int 


Chip_idx_rs[Lmax] 


4*8 


32 


int 


Chip_idx_ds[Lmax] 


4*8 


32 


int 


DelayJag[Lmax] 


4*8 


32 


int 


fingerj dx_max_Iag 


4 


4 


int 


Chip_delay[Lmax] 


4*8 


32 


int 


Sub_chip_delay[Lmax] 


4*8 


32 


COMPLEX 


axcode[Nf][Na][Lmax][Nslots * 2][4] 


8*8*2*8*15*2*4122880 


COMPLEX 


a_hat_ds[Nf][Na][Lmax][Nslots * 2] 


8*8*2*8*15*2 


30720 


COMPLEX* 


mf_y lq[Na] [Lmax] 


4*2*8 


64 


COMPLEX* 


mud_ylq[Na][Lmax] 


4*2*8 


64 


float* 


mf_y_data 


4 


4 


float* 


mud_y_data 


4 


4 


char* 


mfb_data 


4 


4 


char* 


mud_b_data 


4 


4 


char* 


mod_b_data 


4 


4 


char 


Code[Nchips*(l+Nf)] 


1*38400*9 


345600 


COMPLEX 


mud_y]qL_save[Na][Lmax] 


8*2*8 


128 


int 


Mrc_bit_idx_save 


4 


4 


float 


Repetition_rate 


4 


4 


COMPLEXl,2 


mf_ylq[Na][Lmax][Nbitsl * (1+Nf)] 


8*2*8*1200*9 


1382400 


COMPLEXl,2 


mud_^lq[Na][Lmax][Nbitsl *(1+Nf)] 


8*2*8*1200*9 


1382400 


floatl,2 


nrfjy^datatNbitsl * (1+Nf)] 


4*1200*9 


43200 


floatl,2 


mud_y_datapSJbitsl * (1+Nf)] 


4*1200*9 


43200 


char(l,2) 


mfJ)_data[Nbitsl * (1+Nf)] 


1*1200*9 


10800 



41 



WO 02/073937 



PCT/US02/08106 



char(l,2) mudJ>_data[Nbitsl * (1+Nf)] 1*1200*9 10800 

char0,2) mod_b_data[Nbitsl * (1+Nf)] 1*1200*9 10800 



10 



Total 




3,383,304 


x 256 v-users 




866 Mbytes 


OLD: 






COMPLEX 


CodefNchips * 2] 


8*38400*2 614400 


where the following notations are defined: 




1 - Associated data, not explicitly part of structure 




2 - Based on 8 bits per symbol on average 




Lmax 


= NRAKEFTNGERSMAX 


= 8 


Na 


= NANTENNAS 


= 2 


Nslots 


= NSLOTSJPERFRAME 


= 15 


(Nbitsmaxl 


= N_BITS_PER_FRAME_MAX_1 


= 9600) 


Nchips 


= N_CHIPS_PER_FRAME 


= 38400 


Nf 


= NJFRAMES JRAKE_OUTPUT 


= 8 


Nbitsl 


= MEAN_BITS_PER_FRAME_1 


= 150*4.25 -=640. 



20 

Each user class has a specified decoding to be performed. The decoding can 

be: 

30 None 

Soft Repetition Decoding (SRD) 
Turbo decoding 
Convolutional decoding. 

35 All decoding is Soft-Input Soft-Output (SISO) decoding. For example, an SF 64 voice 

user produces 600 soft bits per frame. Thus 1,200 soft bits per 20 ms transmission time inter- 
vals (TTIs) are produced. These 1,200 soft bits are input to a SISO de-multiplex and convolu- 
tion decoding function that outputs 1,200 soft bits. The SISO de-multiplex and convolution 
decoding function reduces the channel bit error rate (BER) and hence improve MUD perfor- 

40 mance. Since data is linear in memory no reformatting of data is necessary and the operation 
can be performed in-place. If further decoders are included, reduced complexity partial-decode 
variants can be employed to reduce complexity. For turbo decoding, for example, the number 
of iterations may be limited to a small number. 
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The Long-code MUD performs the following operations: 
Respread 

Raised-Cosine Filtering 
5 Despread 

Maximal-Ratio Combining (MRC). 

The re-spread function calculates r[t] given by 
10 PW = £ZS*-^-^«] •«*,[r/2560]-c t [r]-S 4 [r/iV t ] 



15 



k=0 p-0 r 



(20) 



The function r[t] is calculated over the interval t = 0 : Nf*M*Nc - 1, where M = 38400 
is the number of chips per frame and Nf is the number of frames processed at a time. The actual 
function calculated is 

Pjt] = p[t +mN c N Mps ] 
r = 0:iV c iV.. -1 

20 which represents a section of the waveform of length Nchips chips, and the calculation 

is performed for m = 0 : Nf*M*Nc / Nchips - 1 . The function is defined (and allocated) for 
negative indices - (Lg -1) : -1, representing the initial conditions which are set to zero at start- 
up. The parameter Lg is the length of the raised-cosine filter discussed below. 

30 Note that every finger of every user adds one and only one non-zero contribution per 

chip within this interval corresponding to chip indices r. Given the delay lag tlq for the qth 
finger of the 1th user we can determine which chip indices r contribute to a given interval. To 
this end define 

35 

t = nN c +q, 0<q<N c 

0<q ¥ <N c (22) 

The first definition defines t as belonging to the nth chip interval; the second is a decom- 
40 position of the delay lag into chip delay and sub-chip delay. Given the above we can solve for 
r and q using 
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r = n—n. 



1 = % 



(23) 



Notice that chip indices r as given above can be negative. In the implementation the 
pointers , c k and b k point to the first element of frame 1 1006 (Figure 10). 

The repeated amplitude-code multiplies are avoided by using: 



10 



15 



(a-c) ¥ [s][c] = 



M*H+1+/), c = 0 

M'H-i+A c=i 



20 



.35 



40 



0, c t [r] =+l+j 

If <*M— 1+7 

2, c t [r]=-l-y 

3, c 4 [r] = +l-y 



(24) 



The raised-cosine filtering operation applied to the re-spread signal r[t] produces an 
estimate of the received signal given by: 



(25) 



where g[t] is the raised-cosine pulse and 
t = 0 : Nc*Nchips - 1 



t' 



0:Lg-l 



Lg = Nsamples-rc (length of raised-cosine filter) 

For example, if an impulse at t = 0 is passed through the above filter the output is g[t]. 
The position of the maximum of the filter then specifies the delay through filter. The delay is 
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10 



relevant since it specifies the synchronization information necessary for subsequent despread- 
ing. The raised cosine filter is calculated over the time period n = ( nl : n2 ) / Nc, where Nc is 
the number of samples per chip, and time is in chips. Note that nl is negative, and the position 
of the maximum of the filter is at n = 0. The length of the filter is then Lg = n2 - nl, and the 
maximum occurs at sample nl. The delay is thus nl samples, and the chip delay is nl / Nc 
chips. For simplicity of implementation nl is required to be a multiple of Nc. 

The de-spread operation calculates the pre-MRC detection statistics corresponding to 
the estimate of the received signal: 



(26) 



15 Prior to the MRC operation, the MUD pre-MRC detection statistics are calculated 
according to: 

y<?[m} = a/b,M + y%>[m)-y% g [m\ ( 27 > 

20 These are then combined with antenna amplitudes to form the post-MRC detection 

statistics: 

^w-^jta^^wj ( 28) 

30 



Multiuser detection systems in accord with the foregoing embodiments can be imple- 
mented in any variety of general or special purpose hardware and/or software devices. Figure 
1 1 depicts one such implementation. In this embodiment, each frame of data is processed three 

35 times by the MUD processing card 118 (or, "MUD processor" for short), although it can be 
recognized that multiple such cards could be employed instead (or in addition) for this purpose. 
During the first pass, only the control channels are respread which the maximum ratio combi- 
nation (MRC) and MUD processing is performed on the data channels. During subsequent 
passes, data channels are processed exclusively, with new y (i.e., soft decisions) and b (i.e., 

40 hard decisions) data being generated as shown in the diagram. 

Amplitude ratios and amplitudes are determined via the DSP (e.g., element 900, or a 
DSP otherwise coupled with the processor board 118 and receiver 110), as well as certain 
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waveform statistics. These values (e.g., matrices and vectors) are used by the MUD processor 
in various ways. The MUD processor is decomposed into four stages that closely match the 
structure of the software simulation: Alpha Calculation and Respread 1302, raised-cosine fil- 
tering 1304, de-spreading 1306, and MRC 1308. Each pass through the MUD processor is 
equivalent to one processing stage of the implementations discussed above. The design is 
pipelined and "parallelized." In the illustrated embodiment, the clock speed can be 132 MHz 
resulting in a throughput of 2.33 ms/frame, however, the clock rate and throughput varies 
depending on the requirements. The illustrated embodiment allows for three-pass MUD pro- 
cessing with additional overhead from external processing, resulting in a 4-times real-time 
processing throughput. 

The alpha calculation and respread operations 1 302 are carried out by a set of thirty-two 
processing elements arranged in parallel. These can be processing elements within an ASIC, 
FPGA, PLD or other such device, for example. Each processing element processes two users 
of four fingers each. Values for b are stored in a double-buffered lookup table. Values of a and 
jd are pre-multiplied with beta by an external processor and stored in a quad-buffered lookup 
table. The alpha calculation state generated the following values for each finger, where sub- 
scripts indicate antenna identifier: 

a 0 = P 0 -(C.a o -yC-ya 0 ) 
;«o = Po-0'C^ 0 +C;S 0 ) 
al = ft. {C-a-jCja,) 
jo. x =%(JCa { + Cj\) 

These values are accumulated during the serial processing cycle into four independent 
8-times oversampling buffers. There are eight memory elements in each buffer and the ele- 
ment used is determined by the sub-chip delay setting for each finger. 

Once eight fingers have been accumulated into the oversampling buffer, the data is 
passed into set of four independent adder-trees. These adder-trees each termination in a single 
output, completing the respread operation. The four raised-cosine filters 1304 convolve the 
alpha data with a set of weights determined by the following equation: 
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The filters can be implemented with 97 taps with odd symmetry. The filters illustrated 
run at 8-times the chip rate, however, other rates are possible. The filters can be implemented 
in a variety of compute elements 220, or other devices such as ASICs, FPGAs for example. 

The despread function 1306 can be performed by a set of thirty-two processing ele- 
ments arranged in parallel. Each processing element serially processes two users of four fin- 
gers each. For each finger, one chip value out of eight, selected based on the sub-chip delay, is 
accepted from the output of the raised-cosine filter. The despread state performs the following 
calculations for each finger (subscripts indicate antenna): 



SF-\ 



0 

SF-l 

20 Jy 0 = ^C-jr 0 -jCr 0 

0 

0 

SF-l 

30 



The MRC operations are carried out by a set of four processing elements arranged in 
parallel, such as the compute elements 220 for example. Each processor is capable of serially 
processing eight users of four fingers each. Values for^y are stored in a double-buffered lookup 
35 table. Values for b are derived from the MSB of the y data. Note that the b data used in the 
MUD stage is independent of the b data used in the respread stage. Values of a and ja <are 
pre-multiplied with P by an external processor and stored in a quad-buffered lookup table. 
Also, ^(a 2 +ja 2 ) for each channel is stored in a quad-buffered table. 

40 The output stage contains a set of sequential destination buffer pointers for each chan- 

nel. The data generated by each channel, on a slot basis, is transferred to the crossbar (or other 
interconnect) destination indicated by these buffers. The first word of each of these transfers 
will contain a counter in the lower sixteen bits indicating how many y values were generated. 
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The upper sixteen bits will contain the constant value 0xAA55. This will allow the DSP to 
avoid interrupts by scanning the first word of each buffer. In addition, the DSPJDPDATE reg- 
ister contains a pointer to single crossbar location. Each time a slot or channel data is transmit- 
ted, an internal counter is written to this location. The counter is limited to 1 0 bits and will wrap 
5 around with a terminal count value of 1 023 . 

The method of operation for the long-code multiple user detection algorithm (LCMUD) 
is as follows. Spread factor for four-channels requires significant amount of data transfer. In 
order to limit the gate count of the hardware implementation, processing an SF4 channel can 
1 0 result in reduced capability. 

A SF4 user can be processed on certain hardware channels. When one of these special 
channels is operating on an SF4 user, the next three channels are disabled and are therefore 
unavailable for processing. This relationship is as shown in the following table: 

15 



20 



SF4 Chan 


Disabled Channels 


SF4 Chan 


Disabled Channels 


0 


1,2,3 


32 


33, 34, 35 


4 


5, 6,7 


36 


37, 38, 39 


8 


9, 10, 11 


40 


41,42, 43 


12 


12, 14, 15 


44 


45, 46, 47 


16 


17, 18, 19 


48 


49, 50, 51 


20 


21,22, 23 


52 


53, 54, 55 


24 


25, 26, 27 


56 


57, 58, 59 


28 


29, 30, 31 


60 


61, 62, 63 



3q The default y and b data buffers do not contain enough space for SF4 data. When a 

channel is operating on SF4 data, they and b buffers extend into the space of the next channel 
in sequence. For example, if channel 0 is processing SF data, the channel 0 and channel 1 b 
buffers are merged into a single large buffer of 0x40 32-bit words. They buffers are merged 
similarly. 

35 

In typical operation, the first pass of the LCMUD algorithm will respread the control 
channels in order to remove control interference. For this pass, the b data for the control chan- 
nels should be loaded into BLUT while the y data for data channels should be loaded into 
YDEC. Each channel should be configured to operate at the spread factor of the data channel 
40 stored into the YDEC table. 

Control channels are always operated at SF 256, so it is likely that the control data will 
need to be replicated to match the data channel spread factor. For example, each bit (b entry) 
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of control data would be replicated 64 times if that control channel were associated with an SF 
4 data channel. 

Each finger in a channel arrives at the receiver with a different delay. During the 
5 Respread operation, this skew among the fingers is recreated. During the MRC stage of MUD 
processing, it is necessary to remove this skew and realign the fingers of each channel. This is 
accomplished in the MUD processor by determining the first bit available from the most 
delayed finger and discarding all previous bits from all other fingers. The number of bits to 
discard can be individually programmed for each finger with the Discard field of the MUD- 
10 PARAM registers. This operation will typically result in a 'short' first slot of data. This is 
unavoidable when the MUD processor is first initialized and should not create any significant 
problems. The entire first slot of data can be completely discarded if 'short' slots are undesir- 
able. 

15 A similar situation will arise each time processing is begun on a frame of data. To avoid 

losing data, it is recommended that a partial slot of data from the previous frame be overlapped 
with the new frame. Trimming any redundant bits created this way can be accomplished with 
the Discard register setting or in the system DSP. In order to limit memory requirements, the 
LCMUD FPG A processes one slot of data at a time. Doubling buffering is used for b and y data 

20 so that processing can continue as data is streamed in. Filling these buffers is complicated by 
the skew that exists among fingers in a channel. 

Figure 12 illustrates the skew relationship among fingers in a channel and among the 
channels themselves. The illustrated embodiment allows for 20us (77.8 chips) of skew among 
30 fingers in a channel and certain skew among channels, however, in other embodiments these 
skew allowances vary. 

There are three related problems that are introduced by skew: Identifying frame & slot 
boundaries, populating b and y tables and changing channel constants. Because every finger of 
35 every channel can arrive at a different time, there are no universal frame and slot boundaries. 
The DSP must select an arbitrary reference point. The data stored in b & y tables is likely to 
come from two adjacent slots. 

Because skew exists among fingers in a channel, it is not enough to populate the b & y 
40 tables with 2,560 sequential chips of data. There must be some data overlap between buffers to 
allow lagging channels to access "old" data. The amount of overlap can be calculated dynami- 
cally or fixed at some number greater than 78 and divisible by four (e.g. 80 chips). The starting 
point for each register is determined by the Chip Advance field of the MUDPARAM register. 
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A related problem is created by the significant skew among channels. As can be seen in 
Figure 12, Channel 0 is receiving Slot 0 while Channel 1 is receiving Slot 2. The DSP must take 
this skew into account when generating the b and y tables and temporally align channel data. 

Selecting an arbitrary "slot" of data from a channel implies that channel constants tied 
to the physical slot boundaries may change while processing the arbitrary slot. The Constant 
Advance field of the MUDPARAM register is used to indicate when these constants should 
change. Registers affected this way are quad-buffered. Before data processing begins, at least 
two of these buffers should be initialized. During normal operation, one additional buffer is 
initialized for each slot processed. This system guarantees that valid constants data will always 
be available. 

The following two tables shown the long-code MUD FPGA memory map and control/ 
status register: 



Start Addr 


End Addr 


Name 


Description 


0000j)000 


0000_0000 


CSR 


Control & Status Register 


0000_0008 


oooo_oooc 


DSPJJPDATE 


Route & Address for DSP updating 


0001_0000 


OOOIJTFF 


MUDPARAM 


MUD Parameters 


0002J)000 


0002_FFFF 


CODE 


Spreading Codes 


0003J)000 


0004 JTFF 


BLUT 


Respread: b Lookup Table 


0005J)000 


OOOS^FFFF 


BETA_A 


Respread: Beta * a Jiat Lookup Table 


0006_0000 


0007 FFFF 


YD EC 


MUD & MRC: y Lookup Table 


0008J)000 


0008 FFFF 


ASQ 


MUD & MRC: Sum ajiat squared LUT 


000AJ)000 


000A FFFF 


OUTPUT 


Output Routes & Addresses 



Bit 


31 | 30 |29 J 28 |27 1 26 | 25 1 24 1 23 1 22 | 21 1 20 |l9 1 18 1 17 1 16 


Name 


Reserved 


RAV 


RO 


Reset 


xxxxxx|xxx xxxx|xxx 



Bit 


15 1 14 13 1 12 11 1 10 1 9 


8 


7 6 


5 


4 


3 


2 


1 


0 


Name 


Reserved 


YB 


CBUF 


A1 


AO 


R1 


RO 


Lst 


Rst 


RAV 


RO 


RO 


RO 


RO 


RO 


Rw 


Rw 


Rw 


Rw 


Reset 


X X |x |X X X X 


0 


0 0 


0 


0 


0 


0 


0 


0 



The register YB indicates which of two y and b buffers are in use. If the system is cur- 
rently not processing, YB indicates the buffer that will be used when processing is initiated. 
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CBUF indicates which of four round-robin buffers for MUD constants (a A beta) is cur- 
rently in use. Finger skew will result in some fingers using a buffer one in advance of this 
indicator. To guarantee that valid data is always available, two full buffers should be initialized 
before operation begins. If the system is currently not processing, CBUF indicates the buffer 
5 that will be used when processing is restarted. It is technically possible to indicate precisely 
which buffer is in use for each finger in both the Respread and Despread processing stages. 
However, this would require thirty-two 32-bit registers. Implementing these registers would be 
costly, and the information is of little value. 

10 A 1 and AO indicate which y and b buffers are currently being processed. A 1 and AO will 

never indicate ' 1' at the same time. An indication of '0' for both Al and AO means that MUD 
processor is idle. Rl and R0 are writable fields that indicate to the MUD processor that data is 
available. Rl corresponds to y and b buffer 1 and R0 corresponds to y and b buffer 0. Writing 
a 4 F into the correct register will initiate MUD processing. Note that these buffers follow strict 

15 round-robin ordering. The YB register indicates which buffer should be activated next. 

These registers will be automatically reset to { 0' by the MUD hardware once processing 
is completed. It is not possible for the external processor to force a '0* into these registers. A 
8 F in this bit indicates that this is the last slot of data in a frame. Once all available data for the 
20 slot has been processed, the output buffers will be flushed. A ' F in this bit will place the MUD 
processor into a reset state. The external processor must manually bring the MUD processor 
out of reset by writing a 4 0' into this bit. 

DSPJJPDATE is arranged as two 32-bit registers. A RACEway™ route to the MUD 
30 DSP is stored at address 0x0000_0008. A pointer to a status memory buffer is located at address 
OxOOOOOOOC. Each time the MUD processor writes a slot of channel data to a completion 
buffer, an incrementing count value is written to this address. The counter is fixed at 10 bits and 
will wrap around after a terminal count of 1023. 

35 A quad-buffered version of the MUD parameter control register exists for each finger to 

be processed. Execution begins with buffer 0 and continues in round-robin fashion. These buf- 
fers are used in synchronization with the MUD constants (Beta * a hat, etc.) buffers. Each 
finger is provided with an independent register to allow independent switching of constant 
values at slot and frame boundaries. The following table shows offsets for each MUD chan- 

40 nel: 
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5 



Offset 


User 




Offset 


User 




Offset 


User 




Offset 


User 


0x0000 


0 


0x0400 


16 


0x0800 


32 


OxOCOO 


48 


0x0040 


1 


0x0440 


17 


0x0840 


33 


0x0C40 


49 


0x0080 


2 


0x0480 


18 


0x0880 


34 


OxOC80 


50 


OxOOCO 


3 


0x04C0 


19 


0x08C0 


35 


OxOCCO 


51 


0x0100 


4 


0x0500 


20 


0x0900 


36 


OxODOO 


52 


0x0140 


5 


0x0540 


21 


0x0940 


37 


0x0D40 


53 


0x0180 


6 


0x0580 


22 


0x0980 


38 


OxOD80 


54 


0x01 CO 


7 


0x05C0 


23 


0xO9C0 


39 


OxODCO 


55 


0x0200 


8 


0x0600 


24 


OxOAOO 


40 


OxOEOO 


56 


0x0240 


9 


0x0640 


25 


OxOA40 


41 


0xOE40 


57 


0x0280 


10 


0x0680 


26 


0X0A80 


42 


OxOE80 


58 


Ox02CO 


11 


0x06C0. 


27 


OxOACO 


43 


OxOECO 


59 


0x0300 


12 


0x0700 


28 


OxOBOO 


44 


OxOFOO 


60 


0x0340 


13 


0x0740 


29 


OxOB40 


45 


OxOF40 


61 


0x0380 


14 


0x0780 


30 


OxOB80 


46 


OxOF80 


62 | 


0x03C0 


15 


0x07C0 


31 


OxOBCO 


47 


OxOFCO 


63 



The following table shows buffer offsets within each channel: 



20 



30 



35 



Offset 


Finger 


Buffer 


0x0000 


0 


0 


0x0004 




1 


0x0008 




2 


OxOOOC 




3 


0x0010 


1 


0 


0x0014 




1 


0x0018 




2 


0x001 C 




3 


0x0020 


2 


0 


0x0024 




1 


0x0028 




2 | 


0x002C 




3 


0x0030 


3 


0 


0x0034 




1 


0x0038 




2 


0x003C 




3 



40 

The following table shown details of the control register: 
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5 



Bit 


31 1 


30 | 29 


28 1 27 


26 


25 


24 


23 


22 


21 


20 


19 


18 


17 


16 


Name 


Spread Factor 


Subchip 
Delay 


Discard 


RAV 


RW 


RW 


RW 


Reset 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


Bit 


15 


14 


13 


12 


11 


10 


9 


8 


7 


6 


5 


4 


3 


2 


1 


0 


Name 


Chip Advance 


Constant Advance 


RAV 


RW 


RW 


Reset 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 


X 







The spread factor field determines how many chip samples are used to generate a data 
bit. In the illustrated embodiment, all fingers in a channel have the same spread factor setting, 
however, it can be appreciated by one skilled in the art that such constant factor setting can be 
15 variable in other embodiments. The spread factor is encoded into a 3-bit value as shown in the 
following table: 



20 



SF Factor 


Spread Factor 


000 


256 


001 


128 


010 


64 


011 


32 


100 


16 


101 


8 


110 


4 


111 


RESERVED 



The field specifies the sub-chip delay for the finger. It is used to select one of eight 
accumulation buffers prior to summing all Alpha values and passing them into the raised- 
cosine filter. Discard determines how many MUD-processed soft decisions (y values) to dis- 
card at the start of processing. This is done so that the first y value from each finger corresponds 
to the same bit. After the first slot of data is processed, the Discard field should be set to zero. 

The behavior of the discard field is different than that of other register fields. Once a 
non-zero discard setting is detected, any new discard settings from switching to a new table 
entry are ignored until the current discard count reaches zero. After the count reaches zero, a 
new discard setting may be loaded the next time a new table entry is accessed. 
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All fingers within a channel will arrive at the receiver with different delays. Chip 
Advance is used to recreate this signal skew during the Respread operation. Y and b buffers are 
arranged with older data occupying lower memory addresses. Therefore, the finger with the 
earliest arrival time has the highest value of chip advance. Chip Advanced need not be a mul- 
5 tiple of Spread Factor. 

Constant advance indicates on which chip this finger should switch to a new set of con- 
stants (e.g. a~ ) and a new control register setting. Note that the new values take effect on the 
chip after the value stored here. For example, a value of 0x0 would cause the new constants to 
10 take effect on chip 1 . A value of OxFF would cause the new constants to take effect on chip 0 
of the next slot. The b lookup tables are arranged as shown in the following table. B values 
each occupy two bits of memory, although only the LSB is utilized by LCMUD hardware. 



15 



20 



30 



35 



40 



Offset 


Buffer 


Offset 


Butter 


Offset 


Buffer 


Offset 


Knife r 


0x0000 


U0B0 


0x0400 


U16B0 


0x0800 


1132 BO 


Oxoroo 


1 I4R Tkd " 
\JHO d\j 


0x0020 


U1B0 


0x0420 


U17 BO 


0x0820 


U33B0 " 




T uo "Rn — 

KJHy DU 


0x0040 


U0B1 


0x0440 


U16B1 


0x0840 


U32 Bl 


lJx(K "40 


114ft Rl 


0x0060 


U1B1 


"0x0460 


U17B1 


0x0860 


U33 Bl 




1 140 m 

UH7 13 I 


0x0080 


U2B0 


0x0480 


U18B0 


0x0880 


UMBO 


Oxorfto 


UJv 13 U 


OxOOAO 


U3 B0 


0x04AO 


U19B0 


0x08A0 


U35 BO 


OxOCAO 


UJ1 DU 


OxOOCO 


U2B1 


0x04C0 


U18B1 


0x08C0 


U34B1 


OxOCCO 


U50B1 


OxOOEO 


U3B1 


0x04E0 


U19B1 


"0x08E0 


U35B1 


OxOCEO 


U51B1 


0x0100 


U4B0 


0x0500 


U20B0 


0x0900 


U36B0 


OxODOO 


U52 BO 


0x0120 


U5B0 


0x0520 " 


U21 B0 


0x0920 


U37 BO 


0x01)20 


U53 BO 


0x0140 


U4B1 


0x0540 


U20B1 


0x0940 


U36B1 


0x0D40 


U52B1 


0x0160 


U5B1 


0x0560 


U21B1 


0x0960 


U37B1 


0x0D60 


U53B1 


0x0180 


U6B0 


0X0580 


U22 B0 


0X0980 


U38 BO 


0x0D80 


U54 BO 


OxOlAO 


U7B0 


Ox05AO 


U23 BO 


0x09A0 


U39 BO 


OxODAO 


U55 BO 


0x0100 


U6B1 


0x05C0 


U22B1 


oxoyco 


U38B1 


OxODCO 


U54B1 


OxOlEO 


U7B1 


Ox051iO 


U23B1 


0x09E0 


U39B1 


OxODBO 


U55B1 


0x0200 


U8B0 


0x0600 


U24B0 


OxOAOO 


U40 BO 


OxOBOO 


U56 BO 


0x0220 


U9B0 


0x0620 


U25 B0 


0x0A20 


U41 BO 


0x0B20 


U57 BO 


0x0240 


U8B1 


0x0640 


U24B1 


0x0A40 


U40B1 


OxOB40 


U56B1 


0x0260 


U9B1 


0x0660 


U25B1 


0x0A60 


U41B1 


0x0E60 


U57B1 


0x0280 


U10B0 


0x0680 


U26B0 


0x0A80 


1)42 BO 


0x0K80 


U58 BO 


0x02A0 


U11B0 


Ox06AO 


U27 BO 


OxOAAO 


U43 BO 


OxOEAO 


U59B0 


0x02C0 


U10B1 


0x06C0 


U26B1 


OxOACO 


U42B1 


OxOBCO 


U58B1 


0x02EO 


U11B1 


0x06E0 


U27B1 


OxOAEO 


U43B1 


OxOHBO 


U59B1 


0x0300 


U12B0 


0x0700 


U28 BO 


OxOBOO 


U44 BO 


OxOFOO 


U60 BO 


0x0320 


U13B0 


0x0720 


U29 B0 


OxOB20 


U45 BO 


OxOE20 


U61 BO 


0x0340 


U12B1 


0x0740 


U28B1 


0x0B40 


U44B1 


0x0F40 


U60B1 


0x0360 


U13B1 


0x0760 


U29B1 


0x0B60 ' 


U45B1 


OxOF60 


U61B1 


0x0380 


U14B0 


0x0780 


U30 B0 


OxOB80 


U46 BO 


OxOF80 


U62B0 


Ox03AO 


U15B0 


Ox07AO 


U31B0 


OxOBAO 


U47 BO 


OxOFAO 


U63 BO 


0x03C0 


U14B1 


0x070) 


U30B1 


OxOBCO 


U46B1 


OxOhCO 


U62B1 


0x03H0 


U15B1 


0x07E0 


U31B1 


OxOBEO 


U47B1 


OxOFBO 


U63B1 
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The following table illustrates how the two-bit values are packed into 32-bit words. 
Spread Factor 4 channels require more storage space than is available in a single channel buffer. 
To allow for SF4 processing, the buffers for an even channel and the next highest odd channel 
are joined together. The even channel performs the processing while the odd channel is dis- 
5 abled. 



Bit 


31 | 30 


29 | 28 


27 | 26 


25 | 24 


23 | 22 


21 j 20 


19 | 18 


17 | 16 


Name 


b(0) 


b(l) 


b(2) 


b(3) 


b(4) 


b(5) 


b(6) 


b(7) 


Bit 


15 14 


13 12 


11 10 


9 8 


7 6 


5 4 


3 | 2 


1 | 0 


Name 


b(8) 


b(9) 


b(10) 


b(ll) 


b(12) 


b(13) 


b(14) 


b(15) 



10 



The beta*a-hat table contains the amplitude estimates for each finger pre-multiplied by 
the value of Beta. The following table shows the memory mappings for each channel. 



20 



30 



Offset 


User 


Offset 


User 


Offset 


User 


Offset 


User 


0x0000 


0 


0x0800 


16 


0x1000. 


32 


0x1800 


48 


0x0080 


1 


0x0880 


17 


0x1080 


33 


0x1880 


49 


0x0100 


2 


0x0900 


18 


0x1100 


34 


0x1900 


50 


0x0180 


3 


0x0980 


19 


0x1180 


35 


0x1980 


51 


0x0200 


4 


OxOAOO 


20 


0x1200 


36 


OxIAOO 


52 


0x0280 


5 


0x0A80 


21 


0x1280 


37 


0x1A80 


53 


0x0300 


6 


OxOBOO 


22 


0x1300 


38 


0x1 BOO 


54 


0x0380 


7 


0x0B80 


23 


0x1380 


39 


0x1 B80 


55 


0x0400 


8 


OxOCOO 


24 


0x1400 


40 


0x1 COO 


56 


0x0480 


9 


OxOC80 


25 


0x1480 


41 


0x1 C80 


57 


0x0500 


10 


OxODOO 


26 


0x1500 


42 


0x1 D00 


58 


0x0580 


11 


0x0D80 


27 


0x1580 


43 


0x1 D80 


59 


0x0600 


12 


OxOEOO 


28 


0x1600 


44 


0x1 E00 ! 


60 


0x0680 


13 


OxOE80 


29 


0x1680 


45 


0x1 E80 


61 


0x0700 


14 


OxOFOO 


30 


0x1700 


46 


0x1 F00 


62 


0x0780 


15 


0x0F80 


31 


0x1780 


47 


0x1 F80 


63 



35 

The following table shows buffers that are distributed for each channel: 



Offset 


User Buffer 


0x00 


0 


0x20 


1 


0x40 


2 


0x80 


3 
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The following table shows a memory mapping for individual ringers of each antenna. 



5 



10 



Offset 


Finnpr 


/vruenna 


0x00 


0 


0 


0x04 


1 




0x08 


2 




OxOC 


3 




0x10 


0 


1 


0x14 


1 




0x18 


2 




0x1C 


3 





The y (soft decisions) table contains two buffers for each channel. Like the b lookup 
table, an even and odd channel are bonded together to process SF4. Each y data value is stored 
as a byte. The data is written into the buffers as packed 32-bit words. 



15 



30 



40 



Offset 


Buffer 


Offset 


Buffer 


Offset 


Buffer 


Offset 


Buffer 


0x0000 


UOBO 


0x4000 


U16B0 


0x8000 


U32 BO 


OxCOOO 


U48 BO 


0x0200 


Ul BO 


0x4200 


U17B0 


0x8200 


U33 BO 


0xC200 


U49 BO 




131 




TT1 Q HI 
Ulo r>l 


UXo4U0 


U34 r>l 


0xC400 


U50 Bl 


0x0600 


U3B1 


0x4600 


U19B1 


0x8600 


U35B1 


0xC600 


U5l Bl 


0x0800 


UO BO 


0x4800 


U16B0 


0x8800 


U32 BO 


0xC800 


U48B0 


OxOAOO 


U1B0 


0x4A0O 


U17B0 


0x8A00 


U33 BO 


OxCAOO 


U49B0 


OxOCOO 


U2B1 


0x4C00 


U18B1 


0x8C00 


U34B1 


OxCCOO 


U50B1 


OxOEOO 


U3B1 


0x4E00 


U19B1 


0x8E00 


U35B1 


OxCEOO 


U5l Bl 


0x0000 


U4 BO 


0x5000 


U20B0 


0x9000 


U36B0 


OxDOOO 


U52 BO 


0x0200 


U5 BO 


0x5200 


U21B0 


0x9200 


U37B0 


0xD200 


U53 BO 


0x0400 


U6B1 


0x5400 


U22B1 


0x9400 


U38B1 


0xD400 


U54B1 


0x0600 


U7B1 


0x5600 


U23B1 


0x9600 


U39B1 


0xD600 


U55B1 


0x0800 


U4B0 


0x5800 


U20B0 


0x9800 


U36B0 


0xD800 


U52 BO 


OxOAOO 


U5 BO 


Ox5AO0 


U21B0 


Ox9AO0 


U37 BO 


OxDAOO 


U53 BO 


OxOCOO 


U6B1 


0x5C00 


U22B1 


0x9C00 


U38B1 


OxDCOO 


U54B1 


OxOEOO 


U7B1 


0x5E00 


U23B1 


0x9E00 


U39B1 


OxDEOO 


U55B1 


0x0000 


U8 BO. 


0x6000 


U24 BO 


OxAOOO 


U40 BO 


OxEOOO 


U56 BO 


0x0200 


U9 BO 


0x6200 


U25 BO 


0xA200 


U4l BO 


0xE200 


U57 BO 


0x0400 


U10B1 


0x6400 


U26B1 


OxA400 


U42B1 


0xE400 


U58B1 


0x0600 


Ull Bl 


0x6600 


U27B1 


OxA600 


U43B1 


0xE600 


U59B1 


0x0800 


U8 BO 


0x6800 


U24B0 


0xA800 


U40 BO 


OxE800 


U56 BO 


OxOAOO 


U9B0 


0x6A00 


U25 BO 


OxAAOO 


U4l BO 


OxEAOO 


U57 BO 


OxOCOO 


U10B1 


0x6C00 


U26B1 


OxACOO 


U42B1 


OxECOO 


U58B1 


OxOEOO 


Ull Bl 


0x6E00 


U27B1 


OxAEOO 


U43B1 


OxEEOO 


U59B1 


0x0000 


U12B0 


0x7000 


U28B0 


OxBOOO 


U44 BO 


OxFOOO 


U60B0 


0x0200 


U13B0 


0x7200 


U29 BO 


0xB200 


U45 BO 


0xF200 


U6l BO 


0x0400 


U14B1 


0x7400 


U30B1 


0xB400 


U46B1 


OxF400 


U62B1 


0x0600 


U15B1 


0x7600 


U31 Bl 


OxB600 


U47B1 


0xF600 


U63B1 


0x0800 


U12B0 


0x7800 


U28 BO 


0xB800 


U44B0 


OxF800 


U60 BO 


OxOAOO 


U13B0 


0x7A00 


U29 BO 


OxBAOO 


U45 BO 


OxFAOO 


U6l BO 


OxOCOO 


U14B1 


0x7C00 


U30B1 


OxBCOO 


U46B1 


OxFCOO 


U62B1 


! OxOEOO 


U15B1 


Ox7EO0 


U31B1 


OxBEOO 


U47B1 


OxFEOO 


U63B1 
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The sum of the a-hat squares is stored as a 16-bit value. The following table contains a 
memory address mapping for each channel. 



5 



0x0000 


0 


0x0200 


16 


0x0400 


32 


0x0600 


48 


Offset 


User 


Offset 


User 


Offset 


User 


Offset 


User 


0x0020 


1 


0x0220 


17 


0x0420 


33 


0x0620 


49 


0x0040 


2 


0x0240 


18 


0x0440 


34 


0x0640 


50 


0x0060 


3 


0x0260 


19 


0x0460 


35 


0x0660 


51 


0x0080 


4 


0x0280 


20 


0x0480 


36 


0x0680 


52 


OxOOAO 


5 


0x02A0 


21 


0x04A0 


37 


0x06A0 


53 


OxOOCO 


6 


0x02C0 


22 


0x04C0 


38 


OxO6C0 


54 


OxOOEO 


7 


0x02E0 


23 


OxO4E0 


39 


0x06E0 


55 


0x0100 


8 


fJx0300 


24 


0x0500 


40 


0x0700 


56 


0x0120 


9 


0x0320 


25 


0x0520 


41 


0x0720 


57 


0x0140 


10 


0x0340 


26 


0x0540 


42 


0x0740 


58 


0x0160 


11 


0x0360 


27 


0x0560 


43 


0x0760 


59 


0x0180 


12 


0x0380 


28 


0x0580 


44 


0x0780 


60 


OxOIAO 


13 


0xO3A0 


29 


0x05A0 


45 


0x07A0 


61 


0x01 CO 


14 


Ox03C0 


30 


OxOSCO 


46 


0x07C0 


62 


0x01 EO 


15 


0x03E0 


31 


0x05E0 


47 


Ox07E0 


63 



Within each buffer, the value for antenna 0 is stored at address offset 0x0 with the value 
for antenna one stored at address offset 0x04. The following table demonstrates a mapping for 
each finger. 



Offset . 


User Buffer 


0x00 


0 


0x08 


1 


0x10 


2 


0x1 C 


3 



Each channel is provided a crossbar (e.g., RACEway™) route on the bus, and a base 
address for buffering output on a slot basis. Registers for controlling buffers are allocated as 
shown in the following two tables. External devices are blocked from writing to register 
addresses marked as reserved. 



Offset 


User 


Offset 


User 


Offset 


User 


Offset 


User 


0x0000 


0 


0x0200 


16 


0x0400 


32 


0x0600 


48 


0x0020 


1 


0x0220 


17 


0x0420 


33 


0x0620 


49 


0x0040 


2 


0x0240 


18 


0x0440 


34 ! 


0x0640 


50 


0x0060 


3 


0x0260 


19 


0x0460 


35 


0x0660 


51 
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5 



10 



0x0080 


4 


0x0280 


20 


0x0480 


36 


0x0680 


52 


OxOOAO 


5 


0x02AO 


21 


0x04A0 


37 


0x06A0 


53 


OxOOCO 


6 


0x02C0 


22 


Ox04CO 


38 


0x06C0 


54 


OxOOEO 


7 


0x02E0 


23 


0x04E0 


39 


0x06E0 


55 


0x0100 


8 


0x0300 


24 


0x0500 


40 


0x0700 


56 


0x0120 


9 


0x0320 


25 


0x0520 


41 


0x0720 


57 


0x0140 


10 


0x0340 


26 


0x0540 


42 


0x0740 


58 


0x0160 


11 


0x0360 


27 


0x0560 


43 


0x0760 


59 


0x0180 


12 


0x0380 


28 


0x0580 


44 


0x0780 


60 


OxOIAO 


13 


0x03A0 


29 


0x05A0 


45 


0x07A0 


61 


0x01 CO 


14 


0x03C0 


30 


0x05C0 


46 


0x07C0 


62 


0x01 EO 


15 


0x03E0 


31 


Ox05EO 


47 


0x07E0 


63 



15 



Offset 


Entry 


OxOOOO 


Route to Channel Destination 


0x0004 


Base Address for Buffers 


0x0008 


Buffers 


OxOOOC 


RESERVED 


0x0010 


RESERVED 


0x0014 


RESERVED 


0x0018 


RESERVED 


0x001 C 


RESERVED 



Slot buffer size is automatically determined by the channel spread factor. Buffers are 
used in round-robin fashion and all buffers for a channel must be arranged contiguously. The 
buffers control register determines how many buffers are allocated for each channel. A setting 
30 of 0 indicates one available buffer, a setting of 1 indicates two available buffers, and so on. 

A further understanding of the operation of the illustrated and other embodiments of 
the invention may be attained by reference to (i) US Provisional Application Serial No. 60/ 
275,846 filed March 14, 2001, entitled "Improved Wireless Communications Systems and 

35 Methods"; (ii) US Provisional Application Serial No. 60/289,600 filed May 7, 2001, entitled 
"Improved Wireless Communications Systems and Methods Using Long-Code Multi-User 
Detection'" and (iii) US Provisional Application Serial Number. 60/295,060 filed June 1, 
2001 entitled "Improved Wireless Communications Systems and Methods for a Communica- 
tions Computer," the teachings all of which are incorporated herein by reference, and a copy 

40 of the latter of which may be filed herewith. 

The above embodiments are presented for illustrative purposes only. Those skilled in 
the art will appreciate that various modifications can be made to these embodiments without 
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departing from the scope of the present invention. For example, multiple summations can be 
utilized by a system of the invention, and not separate summations as described herein. More- 
over, by way of further non-limiting example, it will be appreciated that although the terminol- 
ogy used above is largely based on the UMTS CDMA protocols, that the methods and 
5 apparatus described herein are equally applicable to DS/CDMA, CDMA2000 IX, CDMA2000 
lxEV-DO, and other forms of CDMA. 

Therefore, in view of the foregoing, what we claim is: 

10 



15 



( 

20 



30 



35 



40 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to 
communications computers. The invention has application, by way of non-limiting example, in 
improving the capacity of cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels - driving service 
quality below acceptable levels - when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 

A still farther object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
communications computer, referred to as the "MCW-1" (among other terms) in the materials that 
follow, and methods of operation thereof. An overview of that system is provided in the section 
entitled "Communications Computer," beginning on page 5 hereof. A more complete 
understanding of its implementation may be attained by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to the 
following: 

• architecture and operation of a communications computer for a wireless 
communications system, including a fully programmable computer inserted 
into base transceiver station (BTS) to support compute-intensive and/or highly 
data-dependent functions such as adaptive processing and interference 
cancellation 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-11 hereof, providing description and block 
diagram of a preferred structure and operation of a communications computer for wireless 
applications according to the invention. 

The aforementioned materials pertain to improvements on the methods and apparatus 
described in United States Provisional Application Serial No. 60/275,846, filed March 14, 2001, 
entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS and 
United States Provisional Application Serial No. 60/289,600, filed May 7, 2001, entitled 
IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS USING LONG- 
CODE MULTI-USER DETECTION, the teachings of both of which are incorporated herein by 
reference and copies of at least portions of which are attached hereto. Those copies bears the 
U.S. Postal Service Express Mail label number of both prior filings, as well as that of this filing 
(the latter being referred to as the tc New Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels — driving service 
quality below acceptable levels - when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 

1 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 

A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-1 " (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-1 MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to, the 
following: 

. methods and apparatus for long-code multi-user detection (MUD) in a wireless 
communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-12 hereof, providing a block diagram of a 
preferred algorithm for long code MUD which includes identification of (roughly) how many 
GOPS are involved in each major function; a diagram showing interfaces between a long code 
MUD processing card according to the invention and a modem, e.g., of the type provided by 
Motorola (or another supplier of such components); and two block diagrams of the same 
BASELINE 0 board hardware architecture at a top level identifying the processing nodes. The 
attached diagram entitled "Long-code Mapping to Hardware" illustrates support of 64 users for 
long code MUD and shows parts of the long code MUD algorithm supported by each processing 
node. The diagram entitled "Short-code Mapping to Hardware" illustrates support of 128 users 
for short code MUD and shows parts of the short code MUD algorithm would be supported by 
each processing node. 

The aforementioned materials pertain to improvements on the methods and apparatus 
described in United States Provisional Application Serial No. 60/275,846, filed March 14, 2001, 
entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS, the 
teachings of which are incorporated herein by reference and a copy of which is attached hereto. 
That copy bears the U.S. Postal Service Express Mail label number of both the original filing, as 
well as that of this filing (the latter being referred to as the "New Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels — driving service 
quality below acceptable levels - when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 

1 
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An object of this invention is to provide improved methods and apparatus for wireless 
coinmunications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 

A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-l" (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-l MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to, the 
following: 

• hardware and/or software architectures (and methods of operation thereof) for 
multi-user detection in wireless communications systems and particularly, for 
example, in a wireless communications base station; 

• a hardware architecture (and methods of operation thereof) for multi-user 
detection in wireless communications systems pairing each processing node with 
NVRAM and watchdog PLD for fault management; 

• methods and apparatus for connecting watchdog PLDs with an out-of-band fault- 
management bus; 

• methods and apparatus for use of an embedded host with the RACEway™ 
architecture of Mercury Computer Systems, Inc. 

. methods and apparatus for interfacing a digital signal processor to the 
RACEway™ architecture; 
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• methods and apparatus for interfacing the RACEway™ architecture to a 
programming port in a device for multi-user detection in wireless communications 
systems; 

• methods and apparatus for implementing a DMA Engine FPGA for use in multi- 
user detection in a wireless communications systems; 

• methods and apparatus for implementing a hardware-based reset voter and stop 
voter; 

• methods and apparatus for scalable mapping of handset and BTS functions to 
multiple processors; 

• methods and apparatus for facilitating allocation and management of buffers for 
interconnecting processors that implement the aforementioned mapping; 

• methods and apparatus for implementing a hybrid operating system, e.g., with the 
Vx Works operating system (of WindRiver Systems, Inc.) on a host computer and 
the MC/OS operating system on RACE®-based nodes. (Race and MC/OS are 
trademarks of Mercury Computer Systems, Inc.); 

• methods and apparatus for high-availability multi-user detection in wireless 
communications systems, including (by way of non-limiting example) round- 
robin fault testing and use of NVRAM to store fault symptoms and use of master 
to diagnose faults from NVRAM contents; 

• class library-based methods and apparatus for facilitating interprocessor 
communications, by way of non-limiting example, in buffering for multi-user 
detection in wireless communications systems; 
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• methods and apparatus for implementation of R-matrix, gamma-matrix and MPIC 
computations on separate processors in a device for multi-user detection in 
wireless communications systems; 

• methods and apparatus for computing complementary R-matrix elements in 
parallel using multiple processors in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for depositing results of R-matrix calculations 
contiguously in memory in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for increasing the number of MPIC and R-matrix 
calculations performed in cache in a device for multi-user detection in wireless 
communications systems; 

• methods and apparatus for performing a gamma-matrix calculation in FPGA in a 
device for multi-user detection in wireless communications systems; 

• methods and apparatus for equalizing load of R-matrix-element calculation 
among multiple processors in a device for multi-user detection in wireless 
communications systems; and 

• methods and apparatus for use of Altivec registers and instruction set in 
performing MUD calculations in a wireless communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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43 1 Purpose 

44 The purpose of this document is to describe the software architecture of the 

45 MCW-1 board. The MCW-1 application is a digital signal processing application 

46 that performs interference cancellation for a cellular base station modem board. 

47 The software project consists of 3 major parts: 

48 • Support for the custom MCW- 1 board being designed by the Wireless 

49 Communications Group hardware department. This consists of porting the 

50 existing host (VxWorks) and multicomputer (MC/OS) software to the board, 

5 1 and adding code to support specialized features of the board such as LED 

52 control, voltage monitoring, hardware watchdogs, etc. 

53 • Increasing the MTBF of the system by addition of high availability software. 

54 This software includes monitoring features such as watchdogs, fault 

55 detection/repair algorithms, and remote software download. 

56 • Implementation of the application software. This includes optimal 

57 implementation of the MUD algorithms, as well as implementing degraded 

58 versions of the algorithm that can be executed when some of the 

59 computational hardware is unavailable due to failures. 
60 

61 Detailed information on the design of new software for the MCW-1 board can 

62 be found in the appropriate functional design documents, which are listed in the 

63 References section of this document 

64 2 Glossary 

65 

66 1 . MTBF - Mean Time Between Failures 

67 2. MUD - Multi User Detection. A class of algorithms to detect multiple 

68 interference sources and remove those effects from the signal. 

69 3. Multicomputer - a parallel computer which achieves it's increase in performance 

70 by having more than one CPU working on the application simultaneously. 

71 4. VxWorks - a proprietary real time operating system sold by Wind River, Inc. 

72 3 Application Execution Environment 

73 3.1 Overview 

74 The purpose of the MUD application is to input raw antenna data from the base 

75 station modem card, detect sources of interference, produce a new stream of data 

76 which has had interference removed, and then output the data to the modem card 

77 for further processing. 

78 Characteristics of this processing afe-are that it must have low latency (< 300 

79 microseconds), ami-must deal with large amounts of data (> 1 10 million bytes of 

80 data per second), an d must be very reliable. 

81 The Mercury computer system is well suited to this kind of signal processing, 

82 exhibiting both very low latencies and high bandwidths . 
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83 The system hardware and software were not designed with high availability as 

84 a goal, so reliability is in line with other standard computer systems designed for 

85 commercial applications 

86 Input data flows from the Modem Motherboard, over the PCI bus, through the 

87 PXB++ bridge, onto the fabric, through the crossbar, and into the memory of the 

88 computing elements. Output data flows in the opposite direction. Some data will 

89 also flow between the 8240 Host CPU and the compute elements, via a similar 

90 pathway, i.e. from the PCI bus through the PXB++ and thus onto the fabric. 

91 Although the software tries to treat the system as if the hardware were 

92 symmetric, as can be seen in the following figure, the host 8240 CPU is attached 

93 via the PCI bus, not directly to the fabric. 
94 

95 | Error! Not a valid link, 

96 Figure 1 

97 3.2 Operating System 

98 MC/OS was selected as the operating system for the MCW-1 board because it 

99 provides the low latencies and high I/O and IPC bandwidths required for these 

100 sorts of algorithms, and also because it already provides support for most of the 

101 hardware being incorporated on the MCW- 1 board. 

102 The MUD application can be kept as portable as possible by minimizing the 

103 use of non-POSIX MC/OS system calls, and encapsulating calls into proprietary 

104 MC/OS interfaces such as DX. 

105 MC/OS requires the presence of a host computer system, which in this case 

106 will be a Motorola 8240 PowerPC processor running the Vx Works operating 

107 system. 
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3.3 IPC 



The MC/OS DX subsystem will be used for IPC within the application. This 
API provides low overhead, low latency access to the Mercury DMA engines, 
which in turn provide high bandwidth transfers of data. DX will be used to move 
data between the G4 compute elements during parallel processing, and also will 
be used to move data between the MC/OS compute elements, the VxWorks host 
computer, and the motherboard modem card. 



3.4 I/O 



Input / Output between the MUD card and the motherboard modem card takes 
place by moving data between the Race++ Fabric and the PCI bus via the PXB++ 
bridge. The application will use DX to initialize the PXB-H- bridge, and to cause 
input/output data to move as if it were regular DX IPC traffic. 

Discussions with the customer need to take place in order to determine exactly 
how data flows over the PCI bus. For instance, it is currently unclear who will 
initiate data transfers, and how the initiator will know which PCI addresses should 
be involved in the transfer. A number of meetings with the customer are required 
to resolve these issues. 
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3.5 High Availability 

The approach to high availability on the MCW-1 card is to do most of the high 
availability processing at a time when the application is not running. Specifically, 
faults are handled by rebooting the system (fairly quickly). When the system 
comes up, the application can determine which processing resources are available, 
and it is up to the application to determine how to map its processing needs onto 
the available resources. 

This approach to high availability means that there are short interruptions in 
service, but that the application does not need to know how to continue execution 
across faults. For instance, the application can make the assumption that the 
hardware configuration will not change without the system first rebooting. 

If the application has state which needs to be preserved across reboots, the 
application is responsible for checkpointing the data on a regular basis. The 
system software will provide an API to a portion of the non-volatile RAM for this 
purpose. It should be noted that the non-volatile RAM is quite small, and that 
storage of more than a few hundred bytes of data will require another mechanism 
to be put in place. 

4 Operating System Environment 

4.1 Overview 

Mercury Computer Systems, Inc. has historically had the concept of a host 
computer system. This dates back to the days when Mercury produced array 
processors that were attached to customers' mainframe computers. The evolution 
of Mercury multicomputer^ has left a vestigial host that often performs little more 
service than as a bootstrap device for the multicomputer. 

The host computer system survives in the MCW-1 design primarily as a way to 
reduce schedule risk. The existence of a host computer system is assumed in so 
many ways by the existing Mercury software, that it would add significant 
schedule risk to attempt to remove this assumption in the MCW-1 timeframe. 

In the MCW-1 board, the host system performs the following functions: 

• It configures the Compute Elements, Fabric, and Bridges 

• It loads executable code into the Compute Elements 

• It serves as a bridge to the TCP/IP internetwork 

• It serves as a file system daemon 

• It runs some of the application software 

• It manages some of the specialized high availability hardware 

4.2 Bootstrap 

The host computer system is based on a Motorola 8240 PowerPC processor on 
the MCW-1 board. The 8240 is attached to an amount of linear flash memory. 
This flash memory serves several purposes. 

The first purpose the flash memory serves is as a source of instructions to 
execute when the 8240 comes out of reset. Linear flash is flash which can be 
addressed as if it was normal RAM. Flash memories can also be organized to look 
like disk controllers; however in that configuration they require a disk driver to 
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170 provide access to the flash memory. Although such an organization has several 

171 benefits such as automatic reallocation of bad flash cells, and write wear leveling, 

172 it is not appropriate for initial bootstrap. 

173 The flash memory also serves as a file system for the host (see Section 4.6), 

174 and as a place to store board permanent information (such as a serial number). 

175 Refer to the function design specification (TBS) for more details on how flash 

176 memory is used. 

177 When the 8240 first comes out of reset, memory is not turned on. Since high 

178 level languages such as C assume some memory is present (for a stack, for 

179 instance), the initial bootstrap code must be coded in assembler. This assembler 

1 80 bootstrap should only be a few hundred lines of code, sufficient to configure the 

181 memory controller, initialize memory, and initialize the configuration of the 8240 

182 internal registers. 

183 After the assembler bootstrap has finished execution, control is passed to the 

184 MCW-1 H.A. code (which is also contained in boot flash memory). The purpose 

185 of the H.A. code is to attempt to configure the fabric, and load the compute 

186 element CPUs with H.A. code. Once this is complete, all the processors 

1 87 participate in the B. A. algorithm. The output of the algorithm is a configuration 

1 88 table which details which hardware is operational and which hardware is not This 

1 89 is an input to the next stage of bootstrap, the Multicomputer Configuration. 

190 4.3 Multicomputer Configuration 

191 MC/OS expects the host computer system to configure the multicomputer. The 

192 configmc program reads a textual description of the computer system 

193 configuration, and produces a series of binary data structures that describe the 

194 computer system configuration. These data structures are used in MC/OS to 

1 95 describe the routing and configuration of the multicomputer. 

196 The MCW-1 board will use almost exactly the same sequence to configure the 

197 multicomputer. The major difference is that MC/OS expects configurations to be 

198 totally static, whereas the MCW-1 configuration will need to change dynamically 

199 as faulty hardware cause various resources to be unavailable for use. 

200 There are currently two proposals being considered for how this dynamic 

201 reconfiguration takes place. 

202 The first proposal is that the binary data structures produced by configmc are 

203 modified to include flags that indicate whether a piece of hardware is usable or 

204 not. A modification to MC/OS would prevent it from using hardware marked as 

205 broken. The risk here is that the modifications to MC/OS may be non-trivial. The 

206 benefit may be faster reboot times. 

207 The second proposal is that the output of the H.A. algorithm is used to produce 

208 a new configuration file input to configmc, the configmc execution is repeated 

209 with the new file, and MC/OS is configured and loaded with no knowledge of the 

210 broken hardware whatsoever. This proposal has the added benefit that configmc 

211 may be able to calculate the most optimal routing tables in the face of failed 

212 hardware, minimizing the performance impact of the failure on the remaining 

213 components. This proposal provides risk reduction given that MC/OS changes 

214 would not be required. 
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215 4.4 Multicomputer Loading 

216 After the host computer has configured the multicomputer, the runmc program 

217 loads the functional compute elements with a copy of MC/OS. The only changes 

218 required for the MCW-1 board is for the loading process to examine which 

219 hardware may be offline because it is faulty, and take this into account when 

220 detennining which compute elements need to be loaded. 

221 4.5 TCPIIP Bridge 

222 We believe that the customer is likely to require access to the MCW-1 board 

223 from a TCP/IP network. MC/OS nodes do not contain a TCP/IP stack; therefore 

224 the host computer system acts as a connection to the TCP/IP network. The 

225 Vx Works operating system contains a fully functional TCP/IP stack. All currently 

226 envisioned daemons that need access to the TCP/IP network will run on the host 

227 processor. Should the need arise for compute elements to access network 

228 resources, the host computer would have to act as a proxy, exchanging 

229 information with the compute element utilizing DX transfers, and then making the 

230 appropriate TCP/IP calls on behalf of the compute element. 

231 4.6 File System 

232 The host computer system needs a file system to store configuration files, 

233 executable programs, and MC/OS images. Rotating disks have insufficient MTBF 

234 times; therefore flash memory will be utilized. Rather than have a separate flash 

235 memory from the host computer boot flash, the same flash is utilized for both 

236 bootstrap purposes and for holding file system data. A commercial flash file 

237 system will be purchased and ported which provides DOS file system semantics 

238 as well as write wear leveling. Wear leveling attempts to spread the number of 

239 writes evenly across the sectors of flash memory, as flash memory can only be 

240 written a finite number of times before it is worn out. Modern flash devices can be 

241 written around 100,000 times before they are worn out. 

242 4.7 Remote Software Upgrade 

243 The current design of the MCW-1 board assumes that the customer will want 

244 to update system and application code in the field, via network. There are two 

245 portions of code which need to be updated - the bootstrap code which is executed 

246 by the 8240 processor when it comes out of reset, and the rest of the code which 

247 resides on the flash file system as files. 

248 When code is initially downloaded to the MCW- 1 , it is written as a group of 

249 files within a directory in the flash file system. A single top level file keeps track 

250 of which directory tree is used to boot the system. This file continues to point at 

25 1 the existing directory tree until a download of new software is successfully 

252 completed. When a download has been completed and verified, the top-level file 

253 is updated to point to the new directory tree, the boot flash is rewritten, and the 

254 system can be rebooted. 

255 A possible problem in multi-board systems is how to deal with different 

256 versions of released software on different boards. For instance, if board 1 has 

257 revision 1.0 of the software distribution, and board 2 has revision 1 .1 of the 

258 software distribution, will the two versions work together, or will there be a way 
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259 to ensure that the same version of software is installed on all boards. This issue 

260 does not occur on the MCW-1 because it is a single board solution; therefore this 

261 issue can be addressed at a later time. 

262 A commercial solution to remote software upgrade is available, and has been 

263 ported to VxWorks. It is our intent to port this code at a future date. 

264 5 High Availability 

265 5.1 Goals 

266 The goal of the high availability features of the MCW-1 is to increase the 

267 MTBF of the system as much as possible with little or no increase in cost to the 

268 board. The requirement for minimal cost increase rules out such common 

269 approaches as hot or cold standby, replicated hardware, etc. 

270 It is not a goal to provide uninterrupted computing during hardware or software 

27 1 failures, nor is it a goal to provide fault tolerance. 

272 4^ 25.2 Fault Detection & Isolation 

273 Fault detection is performed by having each CPU in the system gather as much 

274 information about what it observed during a fault, and then comparing the 

275 information in order to detect which components could be the common cause of 

276 the symptoms. In some cases, it may take multiple faults before the algorithm can 

277 detect which component is at fault. The requirement not to add expensive 

278 hardware for fault detection means that in many cases the algorithm will not be 

279 able to determine which component is at fault. 

280 The MCW-1 board has many single points of failure. Specifically, everything 

281 on the board is a single point of failure except for the compute elements. This 

282 means that the only hard failures that can be configured out are failures in the 

283 compute elements. However, many failures are transient or soft, and these can be 

284 recovered from with a reboot cycle. Therefore, we expect the high availability 

285 features to have a positive effect on the MTBF of the card. 

286 More detailed information is available in the functional design specification 

287 (1). 

288 5.3 Degraded Application 

289 In the case of hard failures of a compute element, the application will have to 

290 execute with reduced demand for computing resources. There are several 

291 strategies possible for the MUD algorithm to decrease computing demands, such 

292 as working with a smaller number of interference sources, or performing a less 

293 complete job of interference cancellation. 

294 We expect the computing requirements of the algorithm to be high enough that 

295 failure of more than a single compute element will cause the board to be 

296 inoperative. Therefore, the MCW-1 application only needs to handle two 

297 configurations: all compute elements functional and 1 compute element 

298 unavailable. We believe that a small amount of startup code can map the 

299 application onto the two possible configurations. Note that the single crossbar 

300 means that there are no issues as to which processes need to go on which 

301 processors - the bandwidth and latencies for any node to any other node are 
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302 identical on the MCW-1 . This will not be true of larger systems in the future, and 

303 we will eventually need a way to map computing and I/O requirements onto 

304 arbitrary hardware configurations. 

305 5.4 Remote Software Upgrade 

306 Downtime due to the updating of software is counted against the availability of 

307 a computer system, and therefore a remote reload of software is a necessity. The 

308 MCW-1 is capable of downloading new software during normal operation. The 

309 reboot strategy means that the downtime due to starting up new software is only a 

310 few seconds. 
311 

312 Referenced Documents 

313 

314 1 . "MC/OS High Availability Functional Design Specification", Yevgeniy 

315 Tarashchanskiy, 17 April, 2000. 
316 
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1 REVISION HISTORY 



Revision 0.0 - 3/17/00 Steven Imperial! 
Initial Entry 

Revision 0.01 - 4/25/00 Steven Imperiali 

Minor corrections, filled in missing sections. 

Revision 0.1 - 5/5/00 Steven Imperiali 
Incorporated review comments. 

Revision 0.2 - 5/8/00 Steven Imperiali 

Incorporated review comments. 
Removed reference to RapidlO/Race-H- bridge 
Revision 0.21 - 5/16/00 Steven Imperiali 

Incorporated review comments. 

Modified MPC8240 memory map 
Revision 0.22 - 5/26/00 Steven Imperiali 

Modified MPC8240 Memory Map 

Revision 1 .00 - 7/24/00 Steven Imperial! 
Modified MPC8240 Memory Map 
Updated memo with current design status 

Revision 2.01 - 11/01/00 Steven Imperiali 
Modified power supply ramp requirements 

Revision 2.02 - 1 1/15/00 Steven Imperiali 
Modified interrupt controller 

Revision 2.03 - 1/26/01 Steven Imperiali 
Minor documentation corrections 

Revision 3.00 - 1/31/01 Steven Imperiali 

Modified memo to reflect MCW-1 a modules 



2 REFERENCE DOCUMENTS 

1 . American National Standard for RACE way Interlink (ANSI/VITA 5-1 994) 

2. PCI Rev 2.2 Local Bus Specification 

3. PCE1 33 ASIC Hardware Specification 

4. XBAR++ Function Specification 

5. PXB-H- PCI Bridge Functional Specification 

6. PowerPC 7400 PPC Microprocessor Hardware Specification 

7. Flash Memory Specification p/n TBD 

8. MCW-1 Product Definition Document (PDD) vTBD 

9. Technical brief of Mercury Computer Systems RACE++ series topologies 

1 0. MPC8240 Users Manual (MPC8240UM/D 07/1 999 Rev. 0) 
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3 MERCURY PART NUMBER 

The board identifier name is MCW-1 a and the Mercury part number is 560549. 
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4 FUNCTIONAL DESCRIPTION 
4.1 OVERVIEW 

The MCW-la is designed to be an algorithm processing daughter card utilizing the MPC7400 PPC, MPC8240, 
PCE133 ASIC, XBAR++ ASIC, and PXB++ FPGA. The MCW-lmates with a Motorola base station modem board. 
MCW-la can provide additional connectivity between processing elements in different sector slots utilizing over-the-top 
RACEway-H- cables. It is a Motorola form factor card with four computational nodes and one host node. The 
computational nodes (CNs) are based on the latest MPC7400 PPC microprocessor and the host is an MPC8240. The 
MCW-lcan provide one Ethernet 10/100 BT port on the front panel. A 32-bit, 66 MHz PCI interface provide the 
interface to the Motorola board. 

The MCW-la block diagram is shown in Figure 1. 
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Figure 1. MCW-lA BLOCK DIAGRAM 
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Figure 2 shows the MCW-la system topology. Table 1 gives the proposed route codes for the board. 




D ual O ver-the-Top 



Figure 2. MCW-1 A BOARD-LEVEL TOPOLOGY 
Table 1. Route Codes for MCW-la Board XBAR 



Route Code 


Destination for Virtual Ports 


Physical XBAR 1 Ports 
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14 






15 
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4.2 FEATURES 

• Custom size daughter card 

• Master PCI 32-bit @ 66MHz compliant with REV2.2 PCI local bus spec. 
PCI write peak performance is 240 MB/sec. 

PCI read peak performance is 220 MB/sec. 

• Single IEEE802.3 compliant Ethernet 1 OB ASE-T// 100BASE-T 

• Four computation nodes (CNs) based on MPC7400 PPC running @ 400 MHz. 
1 MB L2 cache per CN @ 200 MHz to 266 MHz. 

128 MB SDRAM with ECC per CN @ 133 MHz. 
Hardware based watchdog monitor. 
One PCE133 ASIC per CN. 

• Two, over-the-top, 66 MHz RACEway-H- interlink ports 
configured in cable mode. 

• PCI interface 32-bit @ 66 MHz. 

• RACEway++ crossbar to connect nodes. 

• PXB-H- 64-bit @ 33 MHz PCI bus. 

• Non-transparent 64-bit/33 MHz to 32-bit/66 MHz PCI bridge. 

• 200MHz PPC8240 PowerPC processor. 
32-bit 33MHz PCI bus. 

300MHz, 64Mbytes SDRAM. 

• Bulk FLASH interface. 
Linear address mode. 
32 banks of IMbytes. 

• LEDs. 

• 8Kbytes non-volatile SRAM. 

• Real time clock. 

• Compute node fault isolation control. 

• JTAG test port. 
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4.3 CONFIGURATION OPTIONS 

43.1 CPU Options 

• MPC7400@400MHz. 

• MPC7410@400MHz. 

4.3.2 SDRAM Options 

• 128 MB SDRAM @ 1 33 MHz with ECC. 

4.33 FLASH Memory Options 

• 1 6 MB FLASH memory. 

• 32MB FLASH memory 

4.3.4 Ethernet Options 

• No Ethernet. 

• Simgle Ethernet 
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4.4 REQUIREMENTS 

4.4.1 Mechanical Form Factor 

The MCW-la form factor conforms to TBD Motorola mechanical requirements. 

4.4.2 Power Requirements 

The MCW-la requires +5.0 volts from the modem board. The +1 .5V to +2.1 V MPC7400 core voltage required by the 
core of MPC7400 is converted from +5.0V on the board. There are two core supplies used to power the four cpu cores. 
The 2.5V voltage required is converted from +5.0 V by an onboard power supply. The 3.3V voltage required is also 
converted from +5.0V by an onboard power supply. The MCW-1 a estimated typical power dissipation is 50 watts @ 
5.0V. - 

4.43 Electrical Interface 

The MCW-la provides a PCI 32-bit, 66 MHz interface to the Motorola modem board via an 80-pin connector. 

The MCW-la provides two over-the-top RACEway++ ports via two connectors located on the front panel. 

The MCW-la provides the single Ethernet 10/100 BT interface available from one RJ-45 connector. The Ethernet 
interface is provided by a third party Ethernet-to-PCI interface controller chip that is bridged to the crossbar 
RACEway++ port by means of a PXB++ FPGA (See Figure 2). 

4.4.4 Functional 

1 . Shall have the Main SDRAM memory at 133MHz or greater. 

2. Shall have a 1Mbyte L2 Cache at 200MHz or greater. 

3. All CE nodes shall have 128Mbyte of SDRAM. 

4. Host node shall have at least 32Mbytes of nonvolatile memory. 

Form factor requirements: 

5. Shall be a daughter card that is % of a Motorola proprietary form factor modem payload card sized 1 1" by 14". On 
20mm centers board to board.{actual shape, dimensions etc TBD via drawings from Motorola.} 

6. Shall be electrically a PMC module, TBD from further discussions with customer. 

7. Shall use PI, P2 for 32/66MHz PCI bus. 

8. Shall have a maximum heat dissipation of 50W 

System requirements 

9. A minimum of 105Mbyte/sec from the modem payload module to the MCW-la card shall be provided through the 
PCI interface. 

10. From the MCW-la card to Motorola Modem Payload module output bandwidth shall be at least 200kbyte/sec, 
concurrent with the 105Mbyte/sec input. 

1 1 . The system shall have a bandwidth of at least 250Mbyte/sec between CE's, e.g. RACE++ at 66Mhz, as a minimum. 

1 2. Shall have non-volatile memory, for at least 32Mbytes of data. 

13. Shall support software upgrade from remote locations. 

4.5 COMPATIBILITY 

The MCW-1 a board is a custom daughter card designed for the Motorola base station modem board. 

4.6 PERFORMANCE 

The PCI bus standard and the PXB++ FPGA limits the RACEway++ to the PCI performance. Peak transfers of 240 
MB/sec are achievable between the PXB++, PPC8240 and the non-transparent PCI Bridge. (See Figure 1) 
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Data transfers of up to 266 MB/sec peak are supported for access from RACEway++ to/from the MPC74O0 CE's local 
SDRAM memory. 

PCE133 ASIC-initiated DMA transfers run at optimum RACEway-H- speeds approaching 266 MB/sec peak. Data can 
be transferred with the DMA from a single DMA command transfer to/from the CN's local SDRAM memory to/from 
RACEwayH-. The DMA engine formats transfers across RACEway++ optimally using packets up to 2048 Bytes. 

The operating clock frequency of the PCE133 ASIC, SDRAM, and MPC7400 processor bus is 133 MHz. Likewise, the 
operating frequency for the RACEway-H- is 66 MHz. The local PCI clock is used by the corresponding PXB-H- FPGA 
and does not exceed 33 MHz. 

A separate 25 MHz oscillator is included on the MCW-la for driving the Ethernet interface. 

4.7 DETAILED DESCRIPTION 

The MC W-la block diagram is shown in Figure 1 . 

4.7.1 Modem Board Interface 
TBD(PCI 32-bit 66MHz). 
TBD PCI to PCI bridge stuff. 
TBD Motorola requirements. 



4.7.2 Board Resets 



There are several sources of reset to the daughter card. A MAX823 voltage supervisor will generate a 200ms 
reset after VCC rises above 4.38 volts. When the MAX823 reset is deasserted, state machine logic will 
monitor PCI_RESET_0. The state machine will continue driving RESETJ) until both the MAX823 and 
PCI_RESET_0 are deasserted. Either reset will generate the signal RESETJ) which will reset the card into its 
power-on state. RESETJ) will also generate the HRESETJ) and TRST signals to the five CPUs. HRESET_0 
and TRST for each of the cpus can also be generated by their JTAG ports; JTAGJiRESETJ) and 
JTAGJTRST respectively. The MCP8240 is capable of generating a reset request, a soft reset (C_SRESETJ)) 
to each CPU, a checkstop request, and a CE ASIC reset (CEJUSSETJ)) to each of the four CE ASICs. A 
discrete from the 5v powered reset PLD will generate the signal NPORESET_l (not a power on reset). This 
signal is fed into the MPC8240's discrete input word. The MPC8240 will read this signal as a logic low only if 
it is coming out of reset due to either a power condition or an external reset from offboard. Each node, as well 
as the MPC8240 may request a board level reset. These requests are majority voted, and the result 
RESETVOTEJ) will generate a board level reset 
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Figure 3 shows the MCW-la hard reset generation function 
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Figure 3. HARD RESET FUNCTIONAL BLOCK DIAGRAM 
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4.7.3 Watchdog Monitor 

There are five independent watchdog monitors on the MCW-la card Each processor node is responsible for 
strobing its watchdog once every 20 msec (initial window after board level reset is 2 sec) but no sooner than 
500 usee. Strobing the watchdog for the processing nodes is accomplished by writing a zero/one sequence to 
the DIAG3 discrete coming from the PC133PCE ASIC. The MPC8240's watchdog is serviced by writing to 
the memory mapped discrete location FFFFJ3027. A single write of any value will strobe the watchdog. Upon 
power-on, the watchdogs come up in a failed state; once a valid strobe is issued; the watchdog will be satisfied. 
If the CPU fails to service the watchdog within the valid window, the watchdog will fail. A watchdog of a 
failing processing node will trigger an interrupt to the MPC8240. An MPC8240 watchdog fault will trigger a 
reset to the board. The watchdog will then remain in a latched failed state until a CPU reset occurs followed by 
a valid service sequence. Figure 4 shows a valid service sequences of the watchdog. 
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Figure 4. EXAMPLE WATCHDOG SERVICE SEQUENCES 



4.7.4 Operating Frequency 

The MPC7400 bus runs at 133 MHz. The L2 cache bus of the MPC7400 runs at 200 MHz to 266 MHz. The SDRAMs 
run at 1 33 MHz. The RACEwayH- interface runs at 66 MHz. The local PCI bus runs at 33 MHz and the off board PCI 
runs at 66MHz. The MPC8240's internal frequency is 200 MHz while its SDRAM interface is 100 MHz. 

4.7.4.1 Clock Margining 

This card has two crystal oscillators for the three clock domains present on the card, a 66 MHz oscillator for the 
RACEway-H- interface and MPC7400 CNs. The 66MHz frequency is divided in half to generate a 33 MHz signal for 
the PCI interface. A second oscillator, 25 MHz, clocks the Ethernet and watchdog circuitry. Both the PCI and MPC 
clocks are marginable. In order to provide clock margining, a 4-pin connector allows the test engineer to functionally 
disable the onboard oscillator and replace it with a test frequency. The pinout of this connector is detailed in Table 2. 
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Table 2. Test Clock Connector 



Pin Signal 

J GND 

2 /Test Clock 

3 Test Clock 

4 Test Clock Enable L 



4.7.5 Serial Configuration EEPROM 

There are several serial EEPROMs used to loadconfiguration to the CE ASICs, PXB++ and XBAR++ after reset. The 
serial PROM functionality can be found in the ASIC's functional specification. 

4.7.5.1 CE ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway-H- bus. It is programmed during 
manufacture of the MCW-la to contain configuration information for CE ASIC. The serial EEPROM AT24C128 is 
controlled from the CE ASIC. After reset, the CE ASIC automatically reads the first location from the serial EPROM. 
Refer to the CE ASIC functional specification, reference 3, for information on reading and writing this device. 



4.7.5.2 PXB++ FPGA Serial EEPROM 
The serial EEPROM can be read and programmed by means of the PCI bus or the RACEway-H- bus. It is programmed 
during manufacture of the MCW-la to contain configuration information for PXB. The serial EEPROM AT24C128 
device is 128K bits and is controlled from the PXB-H-. After reset, the PXB++ automatically reads 8 KB from the serial 
EEPROM and initializes the PXB-H- internal registers. Refer to the PXB-H- FPGA functional specification, reference 5, 
for information on reading and writing this device. 

4.7.53 XBAR-H- ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway-H- bus. It is programmed during 
manufacture of the MCW-la to contain configuration information for XBAR++. The serial EEPROM AT24C128 is 
controlled from the XBAR-H- ASIC. After reset, the XBAR-H- ASIC automatically reads from the serial EPROM and 
initializes the XBAR++ internal registers. Refer to the XBAR++ ASIC functional specification, reference 4, for 
information on reading and writing this device. 

4.7.5.3.1 Register Description 

Reference 4 f describes the registers of the XBAR++ ASIC. 

4.7.6 RACEway-H- Interconnect 

Communication between all processing and I/O elements on the system card is provided by a Mercury eight-port 
crossbar XBAR-H- ASIC. The XBAR-H- provide up to three simultaneous 266 MB/sec peak throughput data paths 
between elements for a total peak throughput of 798 MB/sec. Three crossbar ports connect to the RapidIO Bridge 
FPGA. Each MPC7400 CN uses one crossbar port. The Ethernet and MPC8240 interface to a crossbar port through the 
PXB+-K (See 0) Reference 4 describes the operation and registers of the XBAR-H- ASIC. 

4.7.7 Local PCI I/O Bus 

The PXB++ FPGA provides the local PCI I/O bus. This bus is accessible by means of the RACEway-H- from the 
processing nodes. All resources on this bus are initialized and controlled by the MPC8240. This bus provides access to 
an Ethernet controller, PCI to PCI transparent bridge and the PPC8240 host controller. Transfers from devices on this 
local PCI bus to and from devices on the RACEway-H- can achieve 240 MB/sec for writes and 220 MB/sec for reads. 
These rates assume block transfers of reasonable size. 
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4.7.7.1 PXB++ Program EEPROM 

The PXB-H- FPGA is programmed by an XC18V04 configuration EEPROM running in parallel mode. Configuration 
initiates when a power-on or board level reset occurs. Dividing the onboard 33MHz generates the configuration clock of 
16.6MHz. The configuration EEPROM itself is onboard programmable through the JTAG scan chain. 

4.7.7.1.1 Register Description 

Reference 5 describes the registers of the PXB++ ASIC. 



4.7.8 Ethernet Interface 

The PCI-to-Ethernet interface uses the AM79C973 Pcnet-FAST III single chip 10/100 Mbps Ethernet controller. This 
device is equipped with a built in physical layer interface to achieve a minimal parts count Ethernet interface. A 25 MHz 
oscillator provides the proper clock frequency to the Ethernet chip. The PCI interrupt from the Ethernet chip is wired to 
the MPC8240's external interrupt controller. 



4.7.9 MPC7400 or Nitro Computer Nodes (CNs) 

The board contains four MPC7400 CNs. Each MPC CN uses a PCE133 ASIC to interface the cpu to RACEway-H-. The 
PCE133 ASIC provides all the standard features of a CN, such as a DMA engine, mail box interrupts, timers, 
RACEway-H- page mapping registers, SDRAM interface, and so on. Local memory for each CN consists of 32, 64, or 
1 28 MB SDRAM, and L2 cache SRAM. Each CN also has a nonvolatile SRAM and watchdog monitor. The cpu bus is 
64-bit data, 32-bit address, and operates synchronously at 1 33 MHz. 



4.7.9.1 Processor 

The MCW-la card is designed to use either the 400 MHz MPC7400 or the 400 MHz Nitro processors. The processor is 
packaged in a 25mm, 360-ball CBGA package. Each processor requires the attachment of a heat sink to keep it within 
its thermal limits. 



4.7.9.2 MPC7400 L2 Cache 

The MPC7400 L2 cache for each CN is composed of pipelined, single-cycle deselect, sync burst SRAM. This is 
implemented using two 64K, 128K, or 256K by 36-bit sync burst SRAM parts to make a 0.5 MB, 1 MB, or 2 MB L2 
cache. MPC7400 L2 cache can be depopulated to 0 MB. 

4.7.9.3 PCE133 ASIC 

The MPC processor compute element ASIC (PCE133 ASIC) is a Mercury-designed component. It provides the 
interface between the MPC7400, the synchronous DRAM, and the RACEway++. All the PCE133 features such as 
DMA, mailbox interrupts, timers, address snooping, prefetch buffers, and so on, are available in this configuration. This 
chip is provided in a 35mm, 388-ball BGA package. Reference 3 describes the operation and registers of the PCE1 33 
ASIC. 

4.7.9 3. 1 Register Description 

Reference 3 describes the registers of the PCE133 ASIC. 

4.7.9.4 Address Map 



4.7.9.4.1 Master Address Map 
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Transfers from the MPC7400 to the PCE133 ASIC and RACEway++ are address mapped as shown in Table 3. 
The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway-H- locked read/write and locked read 
transactions are supported for all data sizes. The 16 Mbyte boot FLASH area is further divided in Table 4 
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Table 3. Master Address Map 



From Address 


To Address 




0x0000 0000 


0x0 FFF FFFF 


Local JSDKAM 23o Mt> 


0x1000 0000 


OxlFFFFFFF 


AdAR zdo Md map window i 


0x2000 0000 


0x2FFF FFFF 


a BAR 256 MB map window 2. 


0x3000 0000 


n«irCT T7T7TTE 

UXjrrr rrrr 


adak zjo iv] d map wiiiuow 3 


0x4000 0000 


0x4FFF FFFF 


XBAR 256 MB map window 4 


0x5000 0000 


0x5FFF FFFF 


XBAR 256 MB map window 5 


0x6000 0000 


0x6FFF FFFF 


XBAR 256 MB map window 6 


0x7000 0000 


0x7FFF FFFF 


XBAR 256 MB map window 7 | 


0x8000 0000 


0x8FFF FFFF 


XBAR 256 MB map window 8 1 


0x9000 0000 


0x9FFF FFFF 


XBAR 256 MB map window 9 j 


OxAOOO 0000 


OxAFFF FFFF 


XBAR 256 MB map window A 


OxBOOO 0000 


OxBFFF FFFF 


XBAR 256 MB map window B 


OxCOOO 0000 


OxCFFF FFFF 


XBAR 256 MB map window C 


OxDOOO 0000 


OxDFFF FFFF 


XBAR 256 MB map window D 


OxEOOO 0000 


OxEFFF FFFF 


XBAR 256 MB map window E 


OxFOOO 0000 


OxFBFF FBFF 


Not used (CE reg replicated mapping) 


OxFBFF FCOO 


OxFBFF FDFF 


Internal CN ASIC registers 


OxFBFF FEOO 


OxFEFF FFFF 


Prefetch control 


OxFFOO 0000 


OxFFFF FFFF 


16 MB boot FLASH memory area 



Table 4. Boot FLASH Address Map 



From Address 


To Address 


Function 


OxFFOO 2006 


OxFFOO 2006 


Software Fail Register 


OxFFOO 2005 


OxFFOO 2005 


MPC8240 HA Register 


OxFFOO 2004 


OxFFOO 2004 


Node 3 HA Register 


OxFFOO 2003 


OxFFOO 2003 


Node 2 HA Register 


OxFFOO 2002 


OxFFOO 2002 


Node 1 HA Register 


OxFFOO 2001 


OxFFOO 2001 


Node 0 HA Register j 


OxFFOO 2000 


OxFFOO 2000 


Local HA Register (status/control) 


OxFFOO 0000 


OxFFOO 1 FFF 


NovRAM 



4.7.9.4.2 Slave Address Map 

Slave accesses are defined as accesses initiated by an external RACEway-H- device directed toward the MPC7400 CN. 
The MPC is not accessible as a slave device. The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway++ locked 
read/write and locked read are supported for all data sizes. The PCE RACEway port supports a 256 MB address space 
partitioned as follows in Table 5: 

Table 5. Slave Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FBFF 


256 MB less 1 KB hole SDRAM 


OXfflLFCOO 


0xFFF_FFFF 


PCE133 internal registers 



4.7.9.5 Interrupt 

Reference 3 describes the internal interrupt sources for the PCE 133 ASIC. The external interrupt pin on the PCE 133 
ASIC is driven by the HA PLD and is currently not used. The interrupt output from the PCE133 ASIC is wired to the 
CPU's external interrupt input pin. 
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4.7.9.6 PCE133 DIAG Bits 

The D1AG3 signal is wired to the HA PLD and is used to strobe the nodes hardware watchdog monitor. The DIAG2 
signal is wired to the MPC8240 , s interrupt controller and is used, by the node, to generate a general purpose interrupt to 
the MPC8240. The DIAGBIT signal is wired to the HA PLD and is currently not used. 

4.7.9.7 MPC7400 Reset 

The MPC7400 hard reset signal is driven by three sources gated together: the HRESETJ) pin on the PCE133 ASIC, 
HRESETJ) from the JTAG connector, and HRESETJ) from the majority voter. The HRESETJ) pin from the CE ASIC 
is set by the "node run" bit field (bit 0) of the PCE133 ASICs Miscon^A register. Setting HRESETJ) low causes the 
MPC7400 to be held in reset HRESETJ) is low immediately after system reset or power-up, the MPC7400 is held in 
reset until the HRESETJ) line is pulled high by setting the node run bit to 1. The JTAG HRESETJ) is controlled by 
debugger software when a JTAG debugger module is connected to the card. The HRESETJ) from the majority voter is 
generated by a majority vote from all healthy nodes to reset. 

4.7.9.8 Boot Procedures 

When a cpu reset is asserted, the MPC7400 is put into reset state. The MPC7400 will remain in a reset state until the 
RUN bit 0 of the Miscon_A register is set to 1 and the MPC8240 has released the reset signals in the discrete output 
word. The RUN bit should be set to 1 after the boot code has been loaded into the SDRAM starting at location 
0x0000 _0 1 00. The ASIC maps the reset vector OxFFFO J) 1 00 generated by the MPC7400 to address 0x0000 J) 1 00. 

4.7.9.9 MPC7400 CN SDRAM 

The main memory for each CN is composed of one bank of synchronous DRAM. This is implemented using five 
K4S280832A-TC/L75 @133 MHz synchronous DRAM parts. As shown in the memory map (See Table 3), the main 
memory begins at address 0x0 and grows upward in the address space as memory is increased. The PCE1 33 ASIC 
supports error correction (ECC) on the SDRAM. 

The SDRAM operates as zero wait state memory and can provide up to 1 GB/sec peak bandwidth on writes from 
MPC7400 and 800 MB/sec peak bandwidth on read from the MPC7400. ECC error correction is supported. 

4.7.9.10 MPC7400 Non-Volatile RAM 

Each node will be equipped 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a SIMTEK STK12C68S45 NOVRAM attached to the PCE133 ASIC's 
boot FLASH interface. The data bus of the device is isolated from the PCE ASIC through an IDT 1DTQS32244SO 
buffer. This buffer provides loading isolation and 3.3v to 5v translation. 
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4.7.10 MPC8240 Host Controller 

The MPC8240 integrated processor is comprised of a peripheral logic block and a 32-bit embedded MPC603e PowerPC 
processor core. The peripheral logic integrates a PCI bridge, memory controller, DMA controller, EPIC interrupt 
controller, a message unit, and an I2C controller. The processor core is a full featured, high-performance processor with 
floating-point support, memory management, 16Kbytes instruction cache, 16Kbytes data cache, and power management 
features. 



Major features of the MPC8240 are as follows: 
Peripheral logic 

- Memory interface 

High-bandwidth bus, 64-bit data bus, to SDRAM. 
ECC Protected SDRAM 
16 Mbytes of ROM space (32Mbytes paged). 
8-bit ROM. 

Write buffering for PCI and processor accesses. 

- PCI Interface 

32-bit PCI interface operating at 33 MHz (66 MHz capable). 
PCI 2.1 -compatible. 

Support for accesses to all PCI address spaces. 
Selectable big- or little-endian operation. 

Store gathering of processor-to-PCI write and PCI-to-memory write accesses. 
PCI bus arbitration unit (five Tequest/grant pairs). 

- Two-channel integrated DMA controller 

Supports direct mode or chaining mode (automatic linking of DMA transfers). 
Supports scatter gathering read or write discontinuous memory. 
Interrupt on completed segment, chain, and error. 
Local-to-local memory. 
PCl-to-PCl memory. 
PCI-to-Iocal memory. 
Local-to-PCI memory. 
- Message unit 

Two doorbell registers. 
Inbound and outbound messaging registers. 
.12 0 message controller. 
- 1 2 C controller with full master/slave support 

- Embedded programmable interrupt controller (EPIC) 

Five hardware interrupts (IRQs) or 16 serial interrupts. 
Four programmable timers. 

- Integrated PCI bus and SDRAM clock generation 

- Programmable memory and PCI bus output drivers 

- Debug features 

Memory attribute and PCI attribute signals. 
Debug address signals. 

MTV signal: Marks valid address and data bus cycles on the memory bus. 
Error injection/capture on data path. 
IEEE 1149.1 (JTAG)/test interface. 
Processor core 

- High-performance, superscalar processor core 

Integer unit (1U). 

Foating-point unit (FPU) (user enabled or disabled). 
Load/store unit (LSD). 
System register unit (SRU). 
Branch processing unit (BPU). 
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- 16-Kbyte instruction cache 

- 16-Kbyte data cache 

- Lockable LI cache - entire cache or on a per-way basis 

- Dynamic power management 

4.7.10.1 Address Map 

The MPC8240 in PCI host mode supports two address mapping configurations designated as address map A, and 
address map B. Address map A conforms to the PowerPC reference platform (PReP) specification. Address map B 
conforms to the PowerPC microprocessor common hardware reference platform (CHRP). Note that the support of map 
A is provided for backward compatibility only. It is strongly recommended that new designs use map B because map A 
may not be supported in future devices. 

Address map B complies with the PowerPC microprocessor common hardware reference platform (CHRP). The address 
space of map B is divided into four areas: system memory, PCI memory, PCI I/O, and system ROM space. When 
configured for map B, the MPC8240 translates addresses across the internal peripheral logic bus and the external PCI 
bus as shown in Table 6. 



Table 6. MPC8240 Address Map B 



Processor Core Address Range _J 


PCI Address Range 


Definition 


Hex 


Decimal 


0000j)000 


0O09JFFFF 


0 


640K-1 


NO PCI CYCLE 


System memory 


OOOAJMHX) 


0O0F_FFFF 


640K 


1M-1 


00OA_OO00 - 000F_FFFF 


Compatibility hole 


0010JX)00 


3FFF_FFFF 


1M 


1G-1 


NO PCI CYCLE 


System memory 


4000J)000 


7FFF_FFFF 


1G 


2G-1 


NO PCI CYCLE 


Reserved 


8000_0000 


FCFF_FFFF 


2G 


4G-48M-1 


8000_0000 - FCFF_FFFF 


PCI memory 


FDWLOOOO 


FDFF_FFFF 


4G-48M 


4G-32M-1 


0000 _0000 - O0FF_FFFF 


PCI/ISA memory 


FEOO__0000 


FE7FJFFFF 


4G-32M 


4G-24M-1 


OO00LOOOO - 007F_FFFF 


PCJ/ISA I/O 


FE80J>000 


FEBF_FFFF 


4G-24M 


4G-20M-1 


0080_0000 - 00BF_FFFF 


PCI I/O 


FEC0J)000 


FEDF_FFFF 


4G-20M 


4G-18M-I 


CONFIG_ADDR 


PCI configuration address 


FEECLOO00 


FEEF_FFFF 


4G-18M 


4G-17M-I 


C0NF1G_DATA 


PCI configuration data 


FEF0_00OO 


FEFF_FFFF 


4G-17M 


4G-16M-1 


FEFOJWOO - FEFF_FFFF 


PCI interrupt acknowledge 


FFOO_0000. 


FF7F_FFFF 


4G-16M 


4G-8M-1 


FF00.0000 - FF7F_FFFF 


32/64-bit FLASH/ROM (1) . 


FF80JJ000 


FFFF_FFFF 


4G-8M 


4G-1 


FF80 J)000 ~ FFFF_FFFF 


8/32/64-bit FLASH/ROM (2) 



Notes: 

(1) This bank of FLASH is not used. 

(2) This bank of FLASH is configured in 8-bit mode and is further broken down in Table 7. 



Table 7. Port X Address Map 
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Bank 


Processor Core Address Range 


Definition 


Select 








11111 
1 1 1 1 1 


FFE0_0000 


FFEF_FFFF 


Af*f*PQQ^C Runlr ft 


i ii in . 

Ill lu 


FFEO_0000 


FFEF_FFFF 




onnoi 

UvUVl 








00000 


_ , . ..... . 

FFE0_0000 ' 


. — ,.- 

FFEFlFFFF 


AnnTi^Hrm/lviAfVriHp Yl^ ' 






FFFJELCSFF??: v 


ApgiicatioivDoot-coae.jzl.^-. 




FFFF_D000 


FFFF_D000 


Discrete input word 0 




FFFF_D001 


FFFFJ>001 


Discrete input word 1 




FFFF_D002 


FFFFJD002 


Discrete output word 0 




FFFFJD003 


FFFF_D003 


Discrete output word 1 




FFFF_D004 


FFFFJD004 


Discrete output word 2 




FFFF_D010 


FFFF_D010 


IC (Pending interrupt) 




FFFF_D01 1 


FFFFJD01 1 


IC (interrupt mask low) 




FFFF_D012 


FFFF_D012 


IC (Interrupt clear low) 




FFFF_D013 


FFFFJ3013 


IC (Unmasked, pending low) 




FFFF_D014 


FFFF_D014 


IC (Interrupt input low) 


FFFF__D015 


FFFF_D015 


Unused (read rr) 




FFFFJD016 


FFFF_D016 


Unused (read rrj 




FFFF_D0 1 7 


rrFF„UU17 


Unused (read Fr) 




FFFFJD018 


FFFF_D018 


Unused (read FF) 




FFFF_D039 


FFFF_D0 19 


Unused (read FF) 




FFFF_D020 


FFFFJD020 


HA (Local HA register) 




FFFF_D021 


FFFF_D02 1 


HA (Node 0 HA register) 




FFFF_D022 


FFFF_D022 


HA (Node 1 HA register) 




FFFF_D023 


FFFF_D023 


HA (Node 2 HA register) 




FFFF__D024 


FFFFJ0024 


HA (Node 3 HA register) 




FFFFJD025 


FFFFJD025 


HA (8240 HA register) 




FFFF_D026 


FFFF__D026 


HA (Software Fail) 




FFFF_D027 


FFFFJD027 


HA (Watchdog Strobe) 




FFFF_D028 


FFFF_DFFF 


4068 Bytes FLASH 




FFFF_E000 


FFFF„FFFF 


8K NO VRAM 



Notes: 

(1) Thirtyone 1Mbyte blocks of application memory residing at address FFE0_0000 - FFEF_FFFF selected by the 
FLASH page bits. 

(2) 2Mbyte block available after reset. 

(3) Always available 



4.7 .1 0.2 Register Description 

Reference 1 0 describes the registers of the MPC8240. 



4.7.10.3 Interrupt 

The MPC8240 contains an embedded programmable interrupt controller (EPIC) device. The EPIC implements the 
necessary functions to provide a flexible and general-purpose interrupt controller solution. The EPIC pools haTdware- 
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generated interrupts from many sources, both within the MPC8240 and externally, and delivers them to the processor 
core in a prioritized manner. The solution adopts the OpenPIC architecture (architecture developed jointly by AMD and 
Cyrix for SMP interrupt solutions) and implements the logic and programming structures according to that specification. 
The MPC8240's EPIC unit supports up to five external interrupts, four internal logic-driven interrupts and four timers 
with interrupts. See Reference 10 for a detailed description of the EPIC unit. 

The five external interrupt inputs to the EPIC are wired to the external interrupt controller PLD. 

4.7.10.4 MPC8240 Reset 

The MPC8240 can be reset from three sources: a board level reset (RESET_0), JTAG controlled reset, or a failure in 
it's watchdog monitor. Any reset to the MPC8240 shall cause the discrete output registers to reset (low) state, this in 
turn, will cause all G4 nodes to enter the reset state. 

4.7.10.5 Boot Procedure 

After the release of reset to the MPC8240, it will begin executing code out of the FLASH memory. A reset will 
automatically set the FLASHSEL(4:0) bits to all zero's, therefore, the MPC8240's boot code must reside in bank 0. 
Once it's application code is copied to SDRAM, the MPC8240 can then sequence through the FLASH banks by setting 
the appropriate bits in the discrete output word. Application code for the G4 nodes resides in the remaining thirtyone 
banks of FLASH. 



4.7.11 Bulk FLASH Memory 

There are 32Mbytes of bulk FLASH memory, comprised of two Intel 28F128J3 StrataFLASH memory devices. The 
MPC8240's memory map limits the size of the 8-bit wide FLASH to 2Mbytes, this requires hardware to divide the 
FLASH into thirty-two 1Mbyte banks. Five software-controlled discretes allow switching between banks. Accesses to 
the 1Mbyte address range of FFEOJ)000 through FFEF_FFFF will always access the first first block of FLASH, 
NOVRAM,Discrete I/O, HA registers, watchdog monitor, and the interrupt controller. Accesses to the 1Mbyte address 
range of FFFO J)000 through FFFF_FFFF will access a page of memory in the FLASH. The actual page is selected is 
based on the five FLASH select bits, driven by the Discrete Output word. 

4.7.12 Real Time Clock 

The PCF8563 is a CMOS real-time clock/calendar optimized for low power consumption. A programmable clock 
output, interrupt output and voltage-low detector are also provided. All addresses and data are transferred serially via a 
two-line bidirectional I 2 C-bus. Maximum bus speed is 400 kbits/s. 

Real Time Clock Features: 

- Provides year, month, day, weekday, hours, minutes and seconds 

(Based on an external 32.768 kHz quartz crystal) 

- Century flag 

- Wide operating supply voltage range: 1 .0 to 5.5 V 

- Low back-up current; typical 0.25 mA at VDD = 3.0 V and Tamb =2 °C 

- 400 kHz two-wire 1 2 C-bus interface (at VDD = 1 .8 to 5.5 V) 

- Programmable clock output for peripheral devices: 32.768 kHz, 1024 Hz, 32 Hz and 1 Hz 

- Alarm and timer functions 

- Voltage-low detector 

- Integrated oscillator capacitor 

- Internal power-on reset 

- 1 2 C-bus slave address: read A3H; write A2H 

- Open drain interrupt pin 

4.7.13 Nonvolatile Memory 

The MPC8240 will be equipped with 8Kx8 of non- volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a SIMTEK STK12C68S45 NO VRAM attached to the local bus 
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interface. The device's data bus is isolated from the local bus through an IDT 1DTQS32244SO buffer. This buffer 
provides 33 v to 5v translation. 

4.7.14 Fault Status and Control Registers 

The MPC8240 has access to five 8-bit status registers. One register represents its own status while the others represent 
that fault status of the other four G4 CPUs. Each register has the identical format as shown in Table 8: 
These five registers grant the MPC8240 status information from each node on the board, without going through the 
Raceway fabric. 

The MPC8240 will have one 8-bit Fault control register. The control register for each CPU will have the following 
format as shown in Table 9: 



Bit 


Name 


Description 


0 


CHECKSTOP_OUT 


Checkstop state of CPU (0 = CPU in checkstop) 


1 


WDNLFAULT 


WDM failed (0 = WDM failed, set high after reset and valid service) 


2 


SOFTWARE_FAULT 


Software fault detected (Set to 0 when a software exception was detected) (R/W local) 


3 


RESETREQJbN 


Wrap status of the local CPU's reset request 


4 


WDMJNIT 


WDM failed in initial 2 second window ( 0 = WDM failed) 


5 


Software definable 0 


Software definable 0 


6 


Software definable 1 


Software definable 1 


7 


unused 


unused 



Table 8. Fault Status Register Format 



Bit 


Name ■ 


Description 


0 


RESETREO_OUT_0 


Request a reset event (0 => forces reset) 


1 


CHKSTOPOUTJ) 


Request that node 0 enter checkstop state (0 => request checkstop) 


2 


CHKSTOPOUTJ 


Request that node 1 enter checkstop state (0 => request checkstop) 


3 


CHKSTOPOUT 2 


Request that node 2 enter checkstop state (0 => request checkstop) 


4 


CHKSTOPOUTJ 


Request that node 3 enter checkstop state (0 => request checkstop) 


5 


CHKSTOPOUT.8240 


Request that the MPC8240 enter checkstop state (0 => request checkstop) 


6 


Software definable 0 


Software definable 0 


7 


Software definable 1 


Software definable 1 



Table 9. Fault Control Register Definition 



4.7.15 Majority Voter 

There are two different functions controlled by majority voters. The first is local to each CPU, this voter controls the 
assertion of CHECKSTOP JN to the CPU. The second voter is centralized to the board, it will control the master reset 
to the board. Both voters shall follow the same set of rules: The output will follow the majority of non-checkstopped 
CPUs. A 1-on-l or 2-on-2 condition in either voter will result in a board level reset. 
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4.7.16 Discrete I/O 

There are 16 discrete output signals directly controllable and readable by the MPC8240. The 16 discretes are divided up 
into two addressable 8-bit words. Writing to a discrete output register will cause the upper 8-bits of the data bus to be 
written to the discrete output latch. Reading a discrete output register will drive the 8-bit discrete output onto the upper 
8-bits of the MPC8240's data bus. Table 1 0 defines the bits in the discrete output word. 

There are 16 discrete input signals accessible by the MPC8240. Reads from the discrete input address space will latch 
the state of the signals, and return the latched state of the discretes to the MPC8240. Table 11 defines the bits in the 
discrete input word. 
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Table 10. Discrete Output Words 



Word 2 


DH(0:7) 


Signal 


Description 


0 
1 

2 ! 

3 

4 

5 

6 

7 


NDO FLASH ENJ 
ND1 FLASH EN 1 
ND2 FLASH_EN_1 
ND3 FLASH_EN_1 
Wrapl 


Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Wrap to discrete input 



Word 1 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap to Discrete Input 


1 


I2C RESET 0 


Reset the I2C serial bus when 0 


2 


SWLED 


Software controlled LED 


3 


FLASHSEL4 


Flash bank select address bit 4 


4 


FLASHSEL3 


Flash bank select address bit 3 


5 


FLASHSEL2 


Flash bank select address bit 2 


6 


FLASHSEL1 


Flash bank select address bit 1 


7 


FLASHSELO 


Flash bank select address bit 0 



Word 0 | 


DH(0:7) 


Signal 


Description 


0 


C.SRESET3 0 


Issue a Soft Reset to cpu on Node 3 when 0 


1 


C PRESET3 0 


Reset PCE133 ASIC Node 3 when 0 


2 


C_SRESET2 0 


Issue a Soft Reset to cpu on Node 2 when 0 


3 


C_PRESET2 0 


Reset PCE133 ASIC Node 2 when 0 


4 


C SRESET1 0 


Issue a Soft Reset to cpu on Node 1 when 0 


5 


C_PRESET1 0 


Reset PCE133 ASIC Node 1 when 0 


6 ! 


C_SRESET0 0 


Issue a Soft Reset to cpu on Node 0 when 0 j 


7 


C_PRESET0 0 


Reset PCE133 ASIC Node 0 when 0 



Table 11. Discrete Input Words 



Wordl 


DH(0:7) 


Signal 


Description 


0 


WRAP1 


Wrap from discrete output word 


1 


TBD 


2 


V3.3 FAIL 0 


Latched status of power supply since last reset 


3 


V2.5 FAIL 0 


Latched status of power supply since last reset 


4 


VCORE-LFAIL 0 


Latched status of power supply since last reset 


5 


VCORE0_FAIL 0 


Latched status of power supply since last reset 


6 


RIOR_CNF DONE 1 


RIO/RACE++ FPGA configuration complete 


7 


PXB0_CNF DONE 1 


PXB++ FPGA configuration complete 
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WordO 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap from discrete output word 


1 


WDMSTATUS 


MPC8240's watchdog monitor status (0 = failed) 


2 
3 


NPORESET_1 


Not a power on reset when high 


4 
5 
6 
7 







4.7.17 Interrupt Controller 

The MPC8240 interfaces with an 8-input interrupt controller external from MPC8240 itself. The interrupt inputs are 
wired, through the controller to interrupt zero of the MPC8240 external interrupt inputs. The remaining four MPC8240 
interrupt inputs are unused. 

The Interrupt Controller comprises the following five 8-bit registers; 

Pending Register - A low bit indicates a falling edge was detected on that interrupt (read only) 

Clear Register - Setting a bit low will clear the corresponding latched interrupt (write only) 

Mask Register - Setting a bit low will mask the pending interrupt from generating an MPC8240 interrupt 

Unmasked Pending Register - A low bit indicates a pending interrupt that is not masked out 

Interrupt State Register - indicates the actual logic level of each interrupt input pin 



4.7.17.1 Interrupt Controller Operation 

Table 12 lists the interrupt input sources and their bit positions within each of the six registers. A falling edge on an 
interrupt input will set the appropriate bit in the pending register low. The pending register is gated with the mask 
register and any unmasked pending interrupts will activate the interrupt output signal to the MPC8240 , s external 
interrupt input pin. Software will then read the unmasked pending register to determine which interrupt(s) caused the 
exception. Software can then clear the interrupt(s) by writing a zero to the corresponding bit in the clear register. If 
multiple interrupts are pending, the software has the option of either servicing all pending interrupts at once and then 
clearing the pending register or servicing the highest priority interrupt (software priority scheme) and the clearing that 
single interrupt. If more interrupts are still latched, the interrupt controller will generate a second interrupt to the 
MPC8240 for software to service. This will continue until all interrupts have been serviced. An interrupt that is masked 
will show up -in the pending register but not in the unmasked pending register and will not generate an MPC8240 
interrupt. If the mask is then cleared, that pending interrupt will flow through the unmasked pending register and 
generate an MPC8240 interrupt. 



Table 12. Interrupt Controller Inputs 



Bit 


Signal 


Description 


0 


SWFAIL 0 


8240 Software Controlled Fail Discrete 


1 


RTCJNTJ) 


Real time clock event 


2 


NODE0 FAIL 0 


WDFAIL OorlWDFAIL 0 or SWFAIL 0 active 


3 


NODE1 FAIL 0 


WDFAIL_0 or IWDFAIL 0 or SWFAIL 0 active 


4 


NODE2 FAIL 0 


WDFAILJ) or I WDFAIL 0 or SWFAIL 0 active 


5 


NODE3 FAIL 0 


WDFAILJ) or IWDFAIL 0 or SWFAIL 0 active 


6 


PCIJNT 0 


PCI interrupt 


7 


XB_SYS ERR 0 


XBAR Internal error 
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4.7.18 Configuration Jumpers 

J 18-1 - J 18-2, the watchdog monitor mask, when installed, will mask all watchdog failures. 

Jl 8-3 - J 18-4, the serial EEPROM's write enahle jumper, when installed, enables modification of the serial EEPROMs. 
J 18-5 - J 18-6, the flash write-protect jumper, when installed, prevents modification of any flash memory location. 
J18-7 - J18-8, the PXBO use PROM jumper, when installed will enable the PXBO's serial configuration PROM. 

4.7.19 LEDs 

There are nine LEDs, visible at the top of the board. 



LD1 is a software controlled LED 
LD2 is a software controlled LED 
LD3 is the Node 0 watchdog fail LED 
LD4 is the Node 1 watchdog fail LED 
LD5 is the Node 2 watchdog fail LED 
LD6 is the Node 3 watchdog fail LED 
LD7 is the MPC8240 watchdog fail LED 
LD8 indicates the state of the board level reset 
LD9 indicates a XBAR system error. 

There are an additional two LEDs on the Ethernet connector for Ethernet status (located on the Ethernet connector). 



4.7.20 Power Supply 

The MCW- la board requires 3.3V, 2.5V, and 1,8V. There are two 1 .8V supplies, each drives the core voltage for two 
cpus. To provide power to the MCW- la, the three voltages must have separate switching supplies, and proper power 
sequencing to the device must be provided. All three voltages are converted from 5.0V. The power to the daughter card 
is provided directly from the modem board. 

4.7.20.1 MPC7400 Core Power Supply 

There are two core voltage power supplies, each one is dedicated to two MPC7400 PPC cores. The core voltage can be 
in the 2.2V to 1 .5V range. This power supply is rated at 1 2 A in the range from 2.2V to 1 .5V. 

4.7.20.2 Main 33V Power Supply 

A 3.3V power supply is used to provide power to the SBSRAM core, SDRAM, SCSI, PXB-H-, and XBAR-H- PCE133 
I/O. This power supply is rated at TBD Amp. 

4.7.203 Core and I/O 2.5V Power Supply 

A 2.5V power supply is used to provide power to the PCE133 and can also power the PXB++ FPGA core. The 
MPC7400 processor bus can run at 2.5V signaling. The MPC7400 L2 bus can operate at 2.5V signaling. This 2.5V 
power supply is rated at TBD Amp. 

4.7.20.4 ASICs Power Supplies Tolerance Requirements 
SBSRAM VDD - 3.3V+0.165V/-0. 165 V power supply 

SBSRAM VDDQ = 33V+0.165V/-0.165V for 3.3V I/O or 2.5 V+0.4V/-0. 125V for 2.5V I/O 
SDRAM VDD= 3.3 V-K).3V/-0.3V power supply 
XBAR-H- VDD= 33V40.3 V/-0.3 V power supply 
PCE133 VDD= 2.5V+?V/-?V power supply 
PCE133 VDD33= 3.3V+?V/-?V power supply 
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4.7.205 Power Supply Voltage Sequencing 

The power sequencing is very important in multivoltage digital boards. It is necessary for long-term reliability. The right 
power supply sequencing can be accomplished by using power^good and inhibit signals. To provide fail-safe operation 
of the device, power should be supplied so that if the core supply fails during operation, the I/O supply is shut down as 
well. 

The general rule is to ramp all power supplies up and down at the same time. This is shown in Figure 5. In reality, ramp 
up and down depend on multiple factors: power supply, total board capacities that need to be charged, power supply 
load, and so on. Figure 6 shows ideal worst-case sequencing for ramp up and down that is performed by the protection 
sequencing circuits shown in Figure 7. This circuit keeps the voltage difference within the required range. 
The MPC7400 requires the core supply to not exceed the I/O supply by more than 0.4 volts at all times. Also, the I/O 
supply must not exceed the core supply by more than 2 volts. 



Volt 




Ti 



Figure 5. IDEAL POWER SUPPLY SEQUENCING 




L20Vdd & 
Less 2V during 
Vdd 

Ti 



Figure 6. REAL POWER SUPPLY SEQUENCING 

3.3V 2.5V 



1.8V_1 
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Dl 



D2 



D3 



D4 



D5 



D6 




1.8V2 



D9 



Figure 7. VOLTAGE SEQUENCING CIRCUITS 

0. 7V voltage drops across one diode. 
During power up sequencing: 

Dl and D2 provide the ramp up voltage for the 2.5V power supply as soon as the 3.3V power supply reaches 1 .4V. 
D3 and D4 provide the ramp up voltage for the 1.8 YJ power supply as soon as the 2.5V power supply reaches 1.4V. 
D7 and.D8 provide the ramp up voltage for the 1.8V__2 power supply as soon as the 2.5V power supply reaches 1.4V. 

During power down sequencing: 

D5 provides the ramp down for the 2.5V power supply as soon as the 3.3V power supply reaches 1 .8V. 
D6 provides the ramp down for the L8V_l power supply as soon as the 2.5V power supply reaches 1.1 V. 
D9 provides the ramp down for the 1 .8V_2 power supply as soon as the 2.5V power supply reaches 1 . 1 V. 

The 3.3V power supply is connected to the VCC3P3 power plane. 
The 2.5V power supply is connected to the VCC2P5 power plane. 
The 1 .8V_1 power supply is connected to the VCC1P8_1 power plane. 
The 1 .8 V_2 power supply is connected to the VCC 1 P8_2 power plane. 



4.7.20.6 Power Supply Monitoring 

A PLD is used to monitor the voltage status signals from the onboard supplies. It is powered up from +5V and monitors 
+3.3 V, +2.5 V, 1.8V J and +1.8V_2. This circuit monitors the power_good signals from each supply. In the case of a 
power failure in one or more supplies, the PLD will issue a restart to all supplies and a board level reset to the daughter 
card. A latched power status signal will be available from each supply as part of the discrete input word. The latched 
discrete shall indicate any power fault condition since the last off-board reset condition. 
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5 ELECTRICAL INTERFACE 
5.1.1 Power Consumption 



Table 13. MCW-la CN Power Consumption 



Description 


Qty 


Total Typ. Power 


Total Max. Pwr. 


CE ASIC 


1 


1W 


1.5W 


SDRAM 


5 


3W 


3.5W 


SBSRAM 


2 


1.2W 


2.5W 


G4 


1 


8W 


12W 


Oscillator 


1 


0.1W 


0.1W 


PLD 


1 


0.1 5W 


0.2W 











TBD 



Table 14. MCW-la Power Consumption 



5.1.2 I/O 

5.1 .2.1 Over-the-Top RACEway++ Interlink 

See Appendix A for the over-tbe-top RACEway-H- interlink connector pinout. 

5.1.2.2 PCI 32-Bit Modem Connector 

See Appendix B for the PCI 32-bit modem connector pinout. 

5.1.23 Ethernet 10/100BT 

See Appendix C for the Ethernet 10/100 BT connector pinout. 
5.1.2.4 PPC Debugger 

See Appendix D for the PPC Debugger connector pinout 
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6 MECHANICAL 



6X1 Packaging 

The MCW-lis a dual-side PCB assembly. The board is designed to be used in a custom system. The MCW- 
1PCB is TBD thick and TBD layers. 

6.1.2 Physical Constraint 

The PCB board must comply with the Motorola daughter card form factor. 

7 ENVIRONMENTAL 

7.1 J Temperature & Air Flow 

1 Operating temperature: TBD 

Storage temperature: TBD 

7.1.2 Humidity 
TBD 



7.1.3 Operating Altitude 
TBD 



7.1.4 Shock & Vibration 
TBD 



7.1.5 Compliance 
TBD 



7.1.6 Reliability 
TBD 



8 SWITCHES & JUMPERS 

8.1 J22 Jumper 

Provisional Hotswap switch interface for the PXBO. 



J22 Ref. Des. 


Jumper Function 


1-2 


PXB(LHS_HNDL_SW high 


2-3 


PXB0_HS_HNDL_SW low 



8.2 Jll Jumper 

Raceway clock master selection 



J11 Ref. Des. 


Jumper Function 


1-2 (open) 


MCW-1A Master 


: 1-2 (shorted) 


MCMMASIave 
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8.3 J10 Jumper 

Fl Raceway XBREQI - XBREQO crossover. 



JIORef. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



8.4 J4 Jumper 

F2 Raceway XBREQI - XBREQO crossover. 



J4 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



8.5 J3 Jumper 

F2 Raceway CBL„CLK„0 - CBL_CLK_I crossover. 



J3 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



8.6 J9 Jumper 

Fl Raceway CBL_CLK_0 - CBL_CLKJ crossover. 



J9 Ref. Des. 


Jumper Function 


3-4,5-6 


Straight through 


1-2,7-8 


Crossover 



8.7 Jl 8 Jumper 

Miscellaneous control 



J18 Ref. Des. 


Jumper Function 


1-2 


WDM fail disable 


I 3-4 


Serial PROM write enable 


5-6 


FLASH write enable 


7-8 


PXBO use configuration PROM 


9-10 


Unused 



8.8 J21 Jumper 

Master clock source selector 



J21 Ref. Des. 


Jumper Function 


1-2 


F1 cable port master 


3-4 


F2 cable port master 


Both closed 


MCW-1A master 


Both open 


MCW-1A master 



9 TESTABILITY 
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9.1 JTAG Test Scan 

The MPC7400, MPC8240, PCI-PCI bridge, PCE133 ASIC, PXB-H- ASIC, XBAR++ ASIC, and the Ethernet 
controller provide support for the IEEE Standard 1 149.1 test port (JTAG). Refer to the individual component 
specifications to obtain their JTAG test access port (TAP) descriptions. 

The MCW-la board contains several JTAG scan chains. They provide access to the JTAG test port on the 

MPC7400s, MPC8240, L2 caches, XBAR++, PCE133s, Ethernet, PCI-PCI bridge, and the PXB devices. The 

scan chain is defined as; 

Chain 1 -> MPC7400J 

Chain 2 -> MPC7400_2 

Chain3->MPC7400_3 

Chain4->MPC7400_3 

Chain5->MPC8240 

Chain 6 -> RESET_PLD, PCEFIX1_PLD, NODE0_HA_PLD, NODE 1_HA_PLD, PCEFIX2_PLD, 
NODE2_HA_PLD, NODE3_HA_PLD, 8240__DECODE_PLD > VOTER_SYNCJ>LD, 8240_HA_PLD, 
PXBJPROM, L2 CacheJ, PCE133J, L2 Cache_2, PCE133J2, XBAR, L2 Cache_3, PCE133_3, L2 
Cache_4, PCE133_4, PXB-H-, PCI-PCI Bridge, Ethernet 



The scan path is accessible via connector J 16. The enable for the scan chain buffer is controlled by jumper 
J20. 

The RACEway++ interlink external connectors will be tested with external loop-back connectors. 

Note: Both the RACEway++ clock (66 MHz) and the PCI clock (33 MHz) must be running to allow the scan path in 
the PXB to function properly. 

10 RACEway-H- Over-the-Top Connector Pinout 
Table 15. RACEway++ Fl Cable Mode Connector Pinout J-l 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JX1J0 


A2 


GND 


B2 


JXl_CBL_CLK_lO 


A3 


GND 


B3 


JX1_XBREQJ 


A4 


GND 


B4 


JX1_XBREQ_0 


A5 


GND 


B5 


JXl_XBSTROBIO 


A6 


GND 


B6 


JX1_XBRPLY10 


A7 


GND 


B7 


JXl__XBRDCONIO 


A8 


GND 


B8 


JXl_XBlO00 


A9 


GND 


B9 


JXJ_XBIO01 


AlO 


GND 


BIO 


JXl_XBlO02 


All 


GND 


Bll 


JXl_XBlO03 


A12 


GND 


B12 


JXl_XBIO04 


A13 


GND 


B13 


JXl_XBlO05 


A14 


GND 


B14 


JXl_XBIO06 


A15 


GND 


B15 


JXl_XBIO07 


A16 


GND 


B16 


JXl_XBIO08 


A17 


GND 


B17 


JX1JXBIO09 


A18 


GND 


B18 


JXL.XBIO10 
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A19 


GND 


B19 


JX1_XBI011 


A20 


GND 


B20 


JX1_XBI012 


A21 


GND 


B21 


JX1_XBK>13 


A22 


GND 


B22 


JX1_XBI014 


A23 


GND 


B23 


JX1_XBI015 


A24 


GND 


B24 


JX1_XBI016 


A25 


GND 


B25 


JX1_XBI017 


A26 


GND 


B26 


JX1_XBI018 


A27 


GND 


B27 


JX1_XBK>19 


A28 


GND 


B28 


JXl_XBIO20 


A29 


GND 


B29 


JX1_XBI021 


A30 


GND 


B30 


JX1JXBI022 


A3! 


GND 


B31 


JX1.XBI023 


A32 


GND 


B32 


JX1_XBI024 


A33 


GND 


B33 


JX1_XBI025 


A34 


GND 


B34 


JX1_XBI026 


A35 


GND 


B35 . 


JX1_XBI027 


A36 


GND 


B36 


JX1_XBI028 


A37 


GND 


B37 


JX1_XBI029 


A38 


GND 


B38 


JXl_XBlO30 


A39 


JX1JXBPAR 


i B39 


JX1_XBI031 


A40 


+3.3V 


B40 


R_RSTJX 



Table 16. RACEway-H- F2 Cable Mode Connector Pinout J-2 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JX2_IO 


A2 


GND 


B2 


JX2_CBL_CLKJO 


A3 


GND 


B3 


JX2_XBREQJ 


A4 


GND 


B4 


JX2_XBREQ_0 


A5 


GND 


B5 


JX2_XBSTROBIO 


A6 


GND 


B6 


JX2_XBRPLYIO 


A7 


GND 


B7 


JX2_XBRDCONIO 


A8 


GND 


B8 


JX2_XB]O00 


A9 


GND 


B9 


JX2_XBlO01 


A10 


GND 


BIO 


JX2_JXBIO02 


All 


GND 


Bll 


JX2_XBlO03 


A12 


GND 


B12 


JX2_XBIO04 


A13 


GND 


B13 


JX2_XBIO05 


A14 


GND 


B14 


JX2.XBIO06 


A15 


GND 


B15 


JX2JXB1O07 


A16 


GND 


B16 


JX2_XBIO08 


A17 


GND 


B17 


JX2_XBIO09 


A18 


GND 


B18 


JX2^BIO10 


A19 


GND 


B19 


JX2_XBI011 
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A20 


GND 


B20 


JX2_XBI012 


A21 


GND 


B21 


JX2_XBI013 


A22 


GND 


B22 


JX2_XBI014 ! 


A23 


GND 


B23 


JX2_XBI015 


A24 


GND 


B24 


JX2_XBI016 


A25 


GND 


B25 


JX2_XBI017 


A26 


GND 


B26 


JX2_XBI018 


A27 


GND 


B27 


JX2_XBI019 


A28 


GND 


B28 


JX2_XBIO20 


A29 


GND 


B29 


JX2_XBI021 


A30 


GND 


B30 


JX2_XBI022 


A31 


GND 


B31 


JX2_XBI023 


A32 


GND 


B32 


JX2_XBI024 


A33 


GND 


B33 


JX2_XBI025 


A34 


GND 


B34 


JX2_XBI026 


A35 


GND 


B35 


JX2_XBI027 


A36 


GND 


B36 


JX2_XBI028 


A37 


GND 


B37 


JX2_XBI029 


A38 


GND 


B38 


JX2_XBIO30 


A39 


JX2_XBPAR 


B39 


JX2_XBI031 


A40 


+3.3V 


B40 


R_RSTJX 
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1 1 Modem Board Connector Pinout 

Table 17. Modem Board Connector Pin Assignments 



J29 


Pin 


Signal 


Signal 


Pin 


1 


5V 


PMC_AD0 


2 


3 


5V 


PMC_AD1 


4 


5 


5V 


PMC_AD2 


6 


7 


5V 


PMC_AD3 


8 


9 


PCLRST_0 


PMC_AD4 


10 


11 


GND 


PMC_AD5 


12 


13 


GND 


PMC_AD6 


14 


15 


PMCJDSEL_1 


rift a a 

PMC_AD7 


16 


17 


5V 


PMC_AD8 


18 


19 


5V 


r-i ft j ✓"x a n/"v 

PMC_AD9 


20 


21 


PMC_TRDY_0 


PMC_AD10 


22 


23 


GND 


PMC_AD11 


24 


25 


GND 


PMC_AD12 


26 


27 


PMC_STOP_0 


PMC_AD13 


28 


29 


5V 


PMC_AD14 


30 


31 


5 V 


PMC_AD15 


32 


33 


PMC_PERR_0 


|-l k * /—w «njA 

PMC_AD16 


34 


35 


GND 


PMC_AD17 


36 


37 


GND 


PMC_AD18 


38 


oy 


r IVIL/_oC:r\r\_U 






41 


5V 


PMCJVD20 


42 


43 


5V 


PMCJ\D21 


44 


45 


CLK_PMC 


PMC_AD22 


46 


47 


GND 


PMC_AD23 


48 


49 


GND 


PMC_AD24 


50 


51 


PMC_C_BE0 


PMC.AD25 


52 


53 


PMCC_BE1 


PMC_AD26 


54 


55 


5V 


PMC.AD27 


56 


57 


5V 


PMC_AD28 


58 


59 


PMC_C,BE2 


PMC_AD29 


60 


61 


PMC_C_BE3 


PMC.AD30 


62 


63 


GND 


PMC.AD31 


64 


65 


GND 


5V 


66 
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67 


GND 


PMC_FRAME_0 


68 


69 


PMCJNTAJ) 


GND 


70 


71 


GND 


PMCJRDYJ) 


72 


73 


GND 


5V 


74 


75 


PMC_GNT_0 


PMC_DEVSEL_0 


76 


77 


5V 


PMC_LOCK_0 


78 


79 


PMC_REQ_0 


PMC_PAR 


80 
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12 Processor JTAG Connector Pinout 

The JTAG connectors are unique to each processor. Table 18 shows the generic signal names on each connector pin, the 
actual names will have each processor's extension appended to the generic signal name. 
Table 18. JTAG Jx Connectors Pin Assignments 



Jx- 


SIGNAL 


Jx- 


SIGNAL 


1 


TDO 


2 


OACKN 


3 


TDI 


4 


TRSTN 


5 


HALTEDN 


6 • 


3.3V 


7 


TCK 


8 


CKSTOPJNN i 


9 


TMS 


10 


N.C. 


11 


SRESETN 


12 


N.C. 


13 


HRESETN 


14 


«key» 


15 


CKSTOPJ3UTN 


16 


GND 
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13 Non-Processor JTAG Connector Pinout 

The non-processor JTAG connector ties together all the remaining JTAG capable devices together. Table 18 shows the 
signal names on each connector pin. The connector is designed to only include the programmable PLDs and PROM 
when the program cable is installed, or the entire chain when the Boundary scan test connector is installed. 
Table 19. JTAG J16 Connectors Pin Assignments 



316- 


Signal 


Description 


1 


TMSJTAG 


JTAG Test Mode Select 


2 


TDIJTAG 


JTAG Test Data In 


3 


TDO JTAG 


Boundary Scan Test Data Out 


4 


TESTN 


Driven low when connector inserted 


5 


TCKJTTAG 


JTAG Test Clock 


6 


GND 


Ground on module 


7 


PXB„CNF_TDO 


TDO from end of PLD chain 


8 


TDIJMDO 


TDI into non-PLD Chain 


9 


+5V 


+5V Power on Module 


10 


TEST 


Driven high when connector inserted 



TMS 
TDI 
TCK 
TDO 
Power 



PLD Program Configuration 

^ PLD | — ^ PLD ) PROM 



J16-1 TMS_JTAG 
-J16-2 TDLJTAG 
- J 16-5 TCK_JTAG 

J16-7 PXB_CNF_JDO 

J 16-9 Power 
" TESTN 

GND 



TMS 
TDI 
TCK 



Boundary Scan Test Configuration 

— i ™ N " 



TDO 
Power 



J 16-1 TMS_JTAG 
J1 6-2 TDLJTAG 
J16-5TCK_JTAG 
J16-7 PXB_CNF_TDO 

• J16-8 TD1_ND0 
J16-3TDO_JTAG <i 
J16-9 Power 

• TESTN 

- GND 




Figure 8. JTAG CONNECTOR CONFIGURATION OPTIONS 
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14 Design Notes 
14.1 MPC7400 and Nitro Bus Signaling Voltage Support 





1.8V 


2.5V 


3.3V 


MPC7400 V 60x 


Yes 


Yes 


Yes 


MPC7400 V L2 


Yes 


Yes 


Yes 


Nitro V60x 


Yes 


Yes 


No 


Nitro VL2 


Yes 


Yes 


No 


PCE133V60X 


No 


Yes 


No 


SBSRAM Vi/o 


No 


Yes 


Yes 



14.2 Bypass Capacitors Selection 
(Based on App. Note from Micron TN-00-06) 



Vcore = 3.3V +/- 0. 1 65V, which is 5% 
Vi/o = 2.5V +/- 0.125V, which is 5% 

When the SBSRAMs are driving 21pf load from 0V to 2.5V with Ins edges, the transient current is: 
I - (C * dV)/dt = (30pf*2.5V)/lns = 75ma per one I/O pin. 
For 36 I/O, 36*75ma - 2.7A in Ins time interval. 

The SyncBurst SRAM has a VDD tolerance of 3.3V +/-0.I65V. Considering some droop from the power bus and a switching 
time of 1 ns, and allowing a maximum voltage dip (DV) on the SRAM of -0.05V, the choice of bypass capacitor becomes: 

C - ( I * dt)/dV = (2.7A * l)/0.05 = 54nF per one SBSRAM. 

Choosing 6 x lOnf allows some margin. 

It is better to use reverse ratio capacitors 0508, 0406, or 0204. 

The low ESR is also very important. 

Temperature stable dielectric as X7R. 

From Vishay VJ0402 style X7R. 

14.3 Tantalum Capacitors Selection 

Ultra-low ESR tantalum capacitors T510 are used in the switching power supply, besides several bulk storage capacitors 
distributed around the PCB that feed Vcore and Vi/o plains, to enable quick recharging of the bypass chip capacitors. 
The number of the bulk-storage tantalum capacitors depends on the power supply response time characteristic. 

The MPC7400 can go from nap mode to full-on mode power within two cycles. 

I core - (10W - 2W) /l .8V = 4.5 A 
dt= lOfis 

C = (1 * dt)/dV = (4.5 A * 1 Ous) / 0.05V = 900uF 



PCT/US02/08106 
COMPANY CONFIDENTIAL 



MCW-1 a Functional Specification 



140 



WO 02/073937 



PCT/US02/08106 



TO 

FROM 
ABOUT 

VERSION 
DATE 

COPIES TO 




Computer Systems, Inc. 



mm jm K^ulliyuia jyMCt I 

Mer^ur? 



MEMORANDUM 



Alden Fuchs 

Preliminary Framework interface 

Memorandum # AF-4 
V0.2 

8 December, 2000 



DISTRIBUTION 



141 



WO 02/073937 



PCT/US02/08106 



MEMO AF-4 
Prototype Framework V0. 1 



1. Introduction 3 

1.1. Transform Object 4 

1.2. Red-Box 4 

2. Transform Object Sample 5 

2.1. Include the following files to define the interface, and variables required ........5 

The contents of dx_dma_var.h: 5 

2.2. Initialize the interface 5 

2.3. Receive input 6 

2.3. 1 . An Example of the receiving of data from input pin 0 6 

2.4. Send Output 7 

2.4.1. An Example of the sending data on output pin 0 7 

3. Transforms for WCDM Simulation: 8 

3.1. handset (one of n): 8 

3.1.1. input pins: 8 

3.1.2. Output pins: 8 

3.2. Chan (set of one to m objects): 8 

3.2.1. Input pins: ♦ 8 

3.2.2. Output pins:....; 9 

3.3. broadcast (set of one to k objects): 9 

3.3.1. Input pins: 9 

3.3.2. Output pins: 9 

3.4. Rake (one of n): 9 

3.4.1. Input pins: 10 

3.4.2. Output pins: 10 

3.5. MUX (set of one to L objects): 10 

3.5.1. Input pins: 10 

3.5.2. Output pins: 10 

3.6. MUD (one object for now): 10 

3.6.1. Input pins: 11 

3.6.2. Output pins: 11 

3.7. BER (set of one to m objects): 1 1 

3.7.1. Input pins: 11 

3.7.2. Output pins: • 11 
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MEMO AF-4 
Prototype Framework V0.1 



1. Introduction 

This is a very brief description of the prototype framework and how to use it. The purpose of this 
memo is to describe the software interfaces from within a transform object. 





The above figure depicts the software architecture, and the transform object is a part of the 
Application that is managed by the Application framework. 
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1.1. Transform Object 

The transform object is the basic building block and can be like a Turbo-coder, QAM modulator 
etc. 

1.2. Red-Box 

The red-box collects transform objects into a logical grouping that describes all of the processing 
that will be carried out on a single CPU.. (Note for reasons of non real-time operation eg 
simulation collections of red-boxes can be on a single CPU). 
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2. Transform Object Sample 

2.1. Include the following files to define the interface, and variables 
required.. 

#include "mc_error.h" 
#include "mcwl.h" 
#inciude "dx_dma.h" 
#include "dx_dma_var.h" 

2.1.1. The contents of dx_drna_var.h: 

int my_logical_ce ; 
CONFIG_data *ptr_conf ig_base ; 
CONFIG_data *ptr_cur_conf ig; 
CONFIG_data *ptr_tmp__conf ig; 

int active_in_ce[(MAX_CE+l) * MAX__CHAN] ; 
int active_in_ch[(MAX_CE+l) * MAX__CHAN] ; 
int active_in_buf_size[ (MAX_CE+1) * MAX_CHAN3 ; 
char *active_in_buf [(MAX_CE+1) * MAX_CHAN] ; 
int active_in_index; 

int active_out_ce[(MAX_CE+l) * MAX_CHAN] ; 
int active__out_ch[ (MAX_CE+1) * MAX_CHAN] ; 
int active_out_buf_size[(MAX_CE+l) * MAX_CHAN] ; 
char *active_out_buf [<MAX_CE+1) * MAX_CHAN] ; 
int active_out_index; 

#def ine dma_jsend_pin (pin) \ 

dma_send ( \ 
my_logi ca l_ce , \ 
active_out_ ce [pin] , \ 
active_out_ch[pin] , \ 
( char * * ) &active_out_buf [pin] \ 
) 

ttdefine dma_rec_pin (pin) \ 

dma_rec ( \ 
active_in__ce [pin] , \ 
my_logical_ce, \ 
active_in_ch[pin] , \ 
( char * * ) &act i ve_in_buf [pin] \ 
) 

2.2. Initialize the interface 

// get config SMB 

dma_all_init( 
myJogical_ce, 
active jn_ce, 
active_in__ch, 
activejn_buf_size, 
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activejn_buf, 

(int *)&active_in__index, 

active_out_ce, 

active_out_ch, 

active_out_buf_size, 

active_put_buf, 

(int *)&active_put_index, 

(CONFIG_data **)&ptr_con%J)ase 

); 

ptr_cur_config = &ptr_config_base[my_logical_ce]; 
#ifdef debug_print 

printf(' < Vir CE %i, module name is %s\n", 
my Jogical_ce 5 ptr_cur_config->module__name) ; 
#endif 

ptr_cur_config->state = STATE_RDY; /* all init done now ready */ 
//wait for rx to be ready 

ptr_tmp_config - &ptr_config_base[active_out__ce[0]]; 
while (ptrJmp_config->state != STATE_RDY) //need reciver to be ready 

sched_yield(); 

//wait for tx to be ready 

ptr_tmp_config = &ptr_config_base[active_in_ce[0]]; 
while (ptr_trop_config->state != STATE JRDY) //need reciver to be ready 

sched_yield(); 

#ifdef debug_print 

printf("\nCE %i, Virtual CE %i, StartingW^inOce^getidO^yJogical.ce); 
#endif 

2.3. Receive input 

Receive input data if required, input pins can be left unused. 

2.3.1 . An Example of the receiving of data from input pin 0 

/* get data from other CE */ 
re = dma_recj3in(0); 
ERROR_MCWl(rc); 

OR 

rc = dma_rec( 
active_in_ce[0], 
myJogicaLce, 
active_in_ch[0], 
(char **)&active_in_buf[0] 

); 

ERRORJtfCWl(rc); 
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The data is available in the activejn_buf pointer,, note this always points to the next available 
input buffer in the case of multi -buffering,, at a later date the size of input chunk and offset will be 
provided so that a FIFO like structure can be used. 

2.4. Send Output 

Send output data if required, output pins can be left unused. 

2.4.1. An Example of the sending data on output pin 0 

/* send data to other CE */ 
rc = dma_send_pin(0); 
ERROR JVICWl(rc); 

OR 

rc = (long)dma_send( 
myJogicaLce, 
active__out__ce[0], 
active_out_ch[0], 
(char **)&active_out_birf[0] 

); 

ERROR JMCWl(rc); 

The data in the active_out _buf pointer will be sent, on return this always points to the next 
available output buffer in the case of multi-buffering. At a later date the size of output chunk and 
offset will be provided so that a FIFO like structure can be used. 
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3. Transforms for WCDM Simulation: 

3.1. handset (one of n): 

This object has two input pins and one output pin. It performs the: 

1 . Generate transport channel 

2. MUX and channel coding 

3 . Generate TX waveform 

4. Simulate RX system for Power control etc. 

5 . Outputs to the chan model 

3.1.1. input pins: 

3.1.1.1. power_control pin 0 : 

Input to this pin is from output pin 0 of the rake block and is the slot power control. 

3.1.1.2. next_chunk pin 1: 

Input to this pin is from output pin 1 of the BER block and is the send next n symbols for 
processing e.g. 2 symbols, or a slot etc. 

3.1.1.3. next_chunk pin 1: 

Optional input pin, used to provide external ie outside of the Generate traffic channel bits, access 
to the raw data input ie if we did a codec the output of the codec would go into this block. 

3.1.2. Output pins: 

3.1.2.1. signal_out pin 0 : 

This pin goes to one input pin of the chan object group. 

3.1.2.2. rawjbits pin 1: 

This pin has the raw data bits as encoded into the Data channel so that the BER, BLER 
calculations can be done. 

3.2. Chan (set of one to m objects): 

In this group of objects, each has; two to n input pins; and one output pin each. They collectively 
perform the: 

1 . Channel model for each of the inputs except the carry pin 

2. Sums the local signals, and adds the carry input pin 

3. Outputs to the front_end object to send same data to all rake inputs 

3.2.1. Input pins: 

3.2.1.1. sum_Jn pin 0 : 

Input to this pin is from output pin 0 of other channel object, currently a dummy input is required 
on this pin for the process to fire (needs more thought ie a special first chan??). 

3.2.1.2. signal__in pin 1 to n: 

Input to this pin is from output pin 0 of the handset block. 
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3.2.2. Output pins: 
3.2.2.1. signal„out pin 0 : 

This pin goes to input pin 0 of the broadcast object. 

3.3. f ront_end (one object): 

In this object, each has; one input pin; and one output pin. It performs the: 

1. Adds the multiple antenna, and othc r Receiver distortions and noise 

2. Simulate RX system ( AGC A/D, multiple antennas) etc. 

3. Outputs to the broadcast object to send same data to all rake inputs 

Multiple antennas should be treated as separate data streams. The rake receiver will process them 
independently, until the MRC stage. 

3.3.1. Input pins: 

33.1.1. signaljn pin 0 : 

Input to this pin is from output pin 0 of the last channel object. 

3.3.2. Output pins: 

3.3.2.1. signaLout pin 0 ton: 

This pin goes to input pin 0 of the broadcast objects. 

3.4. broadcast (set of one to k objects): 

This object is required to simulate broadcast, until the simple framework supports this feature, we 
need this object. 

Each object in the group has one input pin and one to n output pins. They collectively perform 
the: 

1. Takes one input and copies it to all of the output pins un-modified 

2. Outputs same data to all rake input 0 pins. 

3.4.1. Input pins: 

3.4.1.1. signaljn pin 0 : 

Input to this pin is from output pin 0 of the front_end object. 

3.4.2. Output pins: 

3.4.2.1. signaI_out pin 0 ton: 

This pin goes to input pin 0 of the rake objects. 

3.5. Rake (one of n): 

This object has one input pin and two output pins. It performs the: 

1. AG€tAFC 

2. Initial signal acquisition and sg earcher receiver^ 

3. Multiple finger receiversR x 



149 



WO 02/073937 



PCT/US02/08106 



MEMO AIM 
Prototype Framework V0.1 



4. Channel estimation, MRC etc. 

5. Final data channel despreading. 
5t6. O utputs toi 

• MUD group of processes 

» Soft-decision symbol processing (FEC decoding and demultiplexing (25.212) 

3.5.1. Input pins: 

3.5.1.1. signaljn pin 0 : 

This is the data from the broadcast set, and carries the signals of all the handsets, and noise etc. 

3.5.2. Output pins: 

3.5.2.1. power_controI pin 0 : 

This is the slot power control to be sent back to the handset. 

3.5.2.2. signal.out pin 0 : 

This pin goes to one input pin of the MUX object group. 

3.6. MUX (set of one to L objects)! 

This object is required to gather and package information from the 1 to n rake objects. The inputs 
are placed into packets(???) or into arrays (???) To Be Determined (TBD). This object should be 
morphed into the best approximation of the packaging to be provided by a targeted modem. 

Each object in the group has one to n input pins and one output pin. They collectively perform 
the: 

1 . Package rake information into simulated modem sourced data. 

2. Outputs to MUD input 0 pin (for now until MUD integration there will be a dummy 
placeholder block). 

3.6.1. Input pins: 
3.6.1.1. signaljn pin 0 to L: 

Input to this pin is from output pin 1 of the a rake object, or another MUX objects output pin 0 . 

3.6.2. Output pins: 

3.6.2.1. signaI_out pin 0 : 

This pin goes to input pin 0 of the rake objects. 

3.7. MUD (one object for now): 

This object is required to place hold until a real mud is implemented. 
MUD has one input pin and one output pin. 

1 . Passes through data and formats it for the BER block 

2. Outputs to BER input 0 pin. 
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3.7.1. Input pins: 

3.7.1.1. signai_in pin 0 : 

Input to this pin is from output pin 0 of the MUX object. 

3.7.2. Output pins: 

3.7.2.1. signaLout pin 0 : 

This pin goes to input pin 0 of the BER object. 

3.8. BER (set of one to m objects): 

This object is required to gather and package infonnation from the 1 to n handset objects and the 
MUD. The inputs are placed into packets(???) or into arrays (???) To Be Determined (TBD). 
This object should be morphed into the best approximation of the packaging to be required by a 
targeted modem. It also compares the raw input data and raw received data. It also does the FEC 
detection and correction and Block error rate. 

Each object in the group has one to n input pins and one to n+1 output pins. They collectively 
perform the: 

1 . Package rake/MUD information into simulated modem destination data. 

2. Perform all of the bit level processing, interleaving, FEC, ~ This should be in a separate block. 

3. BER, BLER etc. BLER should be done via the CRC check, after ail symbol decoding is 
performed. 

O utputs to GUI input 0 pin to display the stats. 
4t5. Outputs the generate the next slot command to the one to n handsets. 

3.8.1. Input pins: 
3.8.1.1. signal_in pin 0 torn: 

Input to this pin is from output pin 0 {for now until MUD integrated} of the MUD object, or 
another output pin 0 of a BER object. 

3.8.2. Output pins: 

3.8.2.1. stats_out pin 0 : 

This pin goes to input pin 0 of the host object for display of data on the GUI. 

3.8.2.2. next_slot pin 1 (one of n): 

This pin goes to input pin 1 of the handset object to indicate the system is ready for the next slot 
of data. 
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From: Jon Greene <greene@mc.com> 

To: "Lauginiger, Frank" <fpl@mc.com>, <joates@mc.com>, <afuchs@mc.com>, 

<mvinskus@ mc.com> 

Date: 6/23/00 3:05PM 

Subject: Some MUD analysis 

All: 

Obviously, I've been thinking about MUD a lot. Below is some analysis. 

First, some news. We apparently have 400 Mhz, 2 meg / 266 Mhz L2 Nitros in 
house (samples). Vitaly is presently working to bring them up. This is 
excellent news. Besides the above speed/size properties, Nitros use 
significantly lower power than Max's and allow for varying L2 configuration 
options. Nitro L2's can be configured the normal way (as a cache) or all or 
half (1 meg) as SRAM memory and can be addressed as such directly. For 
example, one can write a buffer into this memory with vmov or, better yet, 
as the output of some computation. I'm not sure if it could be the source or 
target of a RACEway xfer but we should try to find this out. Even if 
configured as a coherent cache, it can be easily locked and unlocked in user 
mode. I think configuring as 2 meg of SRAM may work the best for MUD but we 
should determine this empirically. 

Now, a critical analysis of ops, buffer sizes, bandwidth, access patterns, 
algorithm structure and phases of the moon, are all essential to arriving at 
a strategy that stands a chance of working. This of course is not easy 
because various techniques impact all of the above in unequal ways. Let's 
just consider the R1/R1m R-matrix processing on the above Nitro with a 
maximum of 100 users. ^Without* taking advantage of the diagonal symmetry in 
the Corr matrix, which I now believe will be very difficult to do in the 
R-matrix ucoded processing loop(s) (we should discuss this), but still 
assuming Corr *can* effectively exist as a byte matrix without degrading 
accuracy beyond acceptability, a single plane (i.e., a processor's worth) of 
the Corr matrix requires 200 * 200 * 32 = 1 ,280,000 bytes which fits, albeit 
uncomfortably, into the L2. At 2 gigabyes/sec (~ 266 * 8), this matrix (if 
L2 resident) can theoretically be consumed in 0.64 ms (remember, 1 .33 ms. is 
our budget). Now, *if * we go with a completely separate X matrix calculation 
without stripmining *and* we also store it as byte values, it would require 
at most 100 * 100 * 32 = 320,000 bytes. This must be entirely produced and 
consumed in the 1 .33 ms. time slice. In *theory\ this can be done In 0.32 
ms. Finally, the R1_temp output is of size 200 * 200 = 40,000 bytes and can 
be produced in .02 ms. So, with the fully separate X matrix approach and no 
symmetry in the Corr, we theoretically require -1,750,000 bytes of buffer 
size (I added a little more for stray stuff such as the C vectors and the 
phys <=> virt Luts, etc.) and -1 .0 ms. to produce and consume these buffers. 
If we stripmined X, which seems a better way to go, we could hopefully keep 
it resident in L1 , thereby reducing L2 buffers to ~1 ,350,000 bytes and 0.7 
ms of l_2 I/O. The stripmining also allows us the option of keeping the X 
strip as shorts rather than bytes. 

Now lets consider the ops count. For the R1/R1m processing (including the 
generation of the X matrix and 2 antennas), I come up with (2 * 6 * 100 * 
100 * 16 + 4 * 200 * 200 * 16) * 750 = (1920,000 + 2,560,000) * 750 = 3.36 
GOPS. (BTW, if you were wondering, 750 = 1000/1.33.) The RO processing has 
less GOPS due to the symmetry. I get (1920,000 + 2,560,000/2) * 750 = 2.40 
GOPS. Since the R0 and R1/R1m processing use the same X matrix, we may be 
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tempted to consider having only the R0 processor compute the X matrix and 
ship it to the R1/R1 m processor. This looks nice from a GOPS perspective (R0 
= 2,40, R1/R1m = 1.92) but I'm not sure it will work very well given the 
lockstep nature of the processing pipe. For example, will the R1/R1m 
processor amply be idle waiting for the X matrix or will it be completing 
the 'prior* R1 Jemp processing while the RO processor is computing the 
current X? 

But the real killer about having RO ship X to R1/R1m is that the X matrix 
(320,000 bytes) will take at least 1.23 ms. over RACE++ 
(320,000/260,000,000). And let's not forget the 40,000 byte R_temp output 
matrix that has to also be shipped out in the same time frame. So I don't 
think this OPs balancing approach will work. 

We therefore appear to require 3.36 GOPS out of R1/R 1 m and we might just not 
even bother with the R0 symmetry since it doesn't buy you very much given 
that mpic needs both R0 and R1/R1m as inputs. In other words, have both 
R-matrix processors run essentially the same code. (Will this work?) 

Now 3.36 GOPS out of one processor is a tall order. We may have to resort to 
a more asymmetric division of labor (The R0 processor takes advantage of the 
R0 symmetry and also does a portion of R1/R1m). But, I'd like to pursue the 
more balanced division until we are absolutely sure it wont work. 

It this approach, both the RO and R1/R1m processors independently produce 
and consume X in strips. A variant could instead produce and consume a 
single "value" (actually 32 shorts) of X in a single ucode primitive that 
does both the complex multiplies and the dot products (the MUDder of all 
primitives). The former is certainly the easier approach and might get us 
all the way there but the latter, if it can be cleverly coded, may perform 
better. In all cases, the ops don't change but at least the L2 gets some 
breathing room. 

In any event, the so-called dot-product loop, whether ifs separate or 
includes the complex multiply, still remains a difficult piece of code to 
fully optimize if we allow the number of virtual to physical users to vary 
as MUD (and Dr. Oates) demands. Using a LUTto acquire the index list and 
count of virtual users for a given physical user will tend to throttle the 
dot product code due to short vector lengths, funny address calculations, 
and "random" load and store patterns. The load isnt so bad since it's two 
cache lines no matter where it comes from. We may want to reorder Con- 
anyway just to.ease the address arithmetic and DST logic. We could also 
simply store in the order we produce and leave it to the mpic processor to 
reorder (poor guy). As for the short vector count, I think this can be 
overcome with a clever primitive that "pauses" as little as possible between 
index lists but this will take some careful design. 

I think we should try for the "balanced" stripmine approach with essentially 
the same two primitives running in each processor. In the absence of 
dissenting views, I will continue modifying the C code to realize this 
structure. I'm still not sure where the Amp/fac__xx multiply(s)/shift(s) 
belong but for now I'll rid them entirely from the R-matrix functions that 
I'm preparing for ucoding. 

-Jon 
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CC: "Kenny , Jamie" <jf k@ mc.com> 
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Report 



To: 



Wireless Communications Group 



From: J. H. Oates 



Subject: Channel Estimation 



Date: October 20, 2000 



1 . Introduction 

In the conventional RAKE receiver, channel amplitude 1 estimation is required for maximal 
ratio combining the RAKE fingers. The BER performance is not strongly dependent on the 
accuracy of the channel amplitude estimates. For Multi-User Detection (MUD) the channel 
amplitude estimates are used for signal subtraction, and accuracy of the channel 
amplitude estimates is more critical. In addition, the channel estimation error is larger 
when MUD is used since channel estimation is performed in a higher interference 
environment. This report investigates the accuracy of the conventional channel amplitude 
estimation techniques under elevated multiple access interference. The effect of channel 
amplitude estimation error on MUD efficiency is then assessed. The analysis presented 
here is intended to be a first-look. There are a number of ways to increase the channel 
amplitude estimation accuracy. A few of these are discussed below. 

Section 2 presents a model for the received signal and match-filter outputs. The effect of 
channel estimation error on MUD efficiency is addressed in section 3. In section 4 the 
accuracy of the conventional channel amplitude estimates is assessed. In section 5 
improved single-user methods are presented for channel amplitude estimation. Section 6 
presents a multi-user channel amplitude estimation method. Section 7 addresses the 
effect of uncancelled multipath on the MUD efficiency, which is used in section 8 to 
assess the effect of dropping small amplitudes. It is shown that the overall MUD efficiency 
is improved by dropping small amplitudes. Conclusions are drawn in section 9. 

2- Signal Model and Matched-Filter Outputs 

The baseband received signal can be written 



1 Amplitudes are complex and hence include magnitude and phase. 
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where f is the integer time sample index, 7= NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, w[t]\s receiver noise, and where 
s k [t] is the channel-corrupted signature waveform for virtual user k. For L multipath 
components the channel-corrupted signature waveform for virtual user Zeis modeled as 

where a kp are the complex multipath amplitudes. Notice that a kp = a /p if k and / are two 
virtual users corresponding to the same physical user. This is due to the fact that the 
signal waveforms of all virtual users corresponding to the same physical user pass 
through the same channel. For multiple antennas a*p is a vector. For dual antennas, for 
example, primary and diversity, 

The waveform s k [t] is referred to as the signature waveform for the kth virtual user. This 
waveform is generated by passing the spreading code sequence c k [ti] through a pulse- 
shaping filter g[t] 

M'] = X*['-rAUcJr] (4) 

where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the received signal r[t] represents the 
baseband signal after filtering by the matched chip filter. Note that for spreading factors 
less than 256 some of the chips c k [r] are zero. 

Combining Equations (1) through (4) gives 

The output of the despreading operation for a single multipath component is the complex 
statistic 
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where f lg is the estimate of x lq , and A// is the (non-zero) length of code c/n/. The values 
yjgfrnjare complex and are referred to as the pre-MRC matched-filter outputs. For multiple 
antennas, r[t], w[t], y lq [m] and w iq [m] are column vectors. 

The matched-filter output is then 

= £X Re |£i>/X • C ^f m ']| • 6 t [m - m'] + Wj [m] 

= X X r « t" 1 ' 3 ' fc* ['n ~ ] + [m] (7) 
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where <2," is the estimate of af q and is the match-filtered receiver noise. The terms 
for m'^o result from asynchronous users. 



157 



WO 02/073937 



PCTAJS02/08106 



3. Effect of Amplitude Estimation Error on MUD Efficiency 

MUD efficiency is defined in terms of the ratio of the intra-cell interference with MUD (I M ud) 
to the intra-cell interference with the Matched Filter (MF), that is, the intra-cell interference 
without MUD {I M fY- 

1 MF 

The total interference without MUD is l MF + J, where J is the inter-cell interference. 
Similarly, the total interference with MUD is l M uo + J. The ratio of inter-cell interference to 
intra-cell interference without MUD is denoted f=J/l M F. The increase in system capacity is 
equal to the ratio of the total interference without MUD to the total interference with MUD, 
which is (Imf + J)/(Imud + J) = Omf + AmfVOmud + Amf) = (1 + 0/(1 - Pmud + 0- For f= 0.3 and 
Pmud = 0.7, MUD increases the system capacity by a factor of 1.3/(1 - 0.7 +0.3) = 2.2. 
Hence, if our goal is to double system capacity the MUD efficiency must be approximately 
70% or greater. 

In the following we estimate the loss in MUD efficiency, 1 - Pmud, due to imperfect channel 
estimation. For simplicity of presentation we consider approximately synchronous users. 

Recall that in a synchronous system the matched -filter outputs can be expressed as 

+ X r /A+*?/ (9) 

and that the intra-cell interference is then 

I«r = < 10 > 

The effect of channel amplitude errors is that the estimates of the R-matrix elements (r !k ) 
are imperfect, which reduces the interference that is cancelled. When MUD is employed 
with imperfect R-matrix estimates the detection statistic is 

y,- 2/A = 

K K 

where for the present case we have assumed that the bit estimates are perfect. With 
MUD the intra-cell interference is 
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Now from Equation (7), specialized for synchronous users 

L<r=i p=i J 



(12) 



1 r l 

XZEK^ ' C lk q p+ a kp% C lkgpS 
1 9=1 P=l 



9 =1 

.A 

£ L 



Z 7=1 P=l 



Hence the second-order statistics are 
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= ^^E^2-[l + |p| 2 ] 
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4 2 -ix> is-ix .... 

9=1 p=i l'4J 



159 



WO 02/073937 PCTAJS02/08106 



where we have assumed that the amplitude error is independent of the amplitude and we 
have used 



(15) 



The second expression is discussed below. We refer to E* as the error amplitude for the 
fan virtual user. The residual interference after MUD IC is 



W= 14r, t -r lk ) 2 } 

k=Lk*l 

A 2 



IN, 

A 2 K 

: 2N, 



[(*-l)«E2 + J^]-2.[l+|p| 2 ] 



(16) 



[« + j8 c 2 fe-2-[l+|pP] 



where all data channels have amplitude A. The error amplitude for the control channels is 
denoted E c and the error amplitude for the data channels is denoted E d . All data channel 
amplitudes are determined by scaling the corresponding control channel amplitudes by 
1/p c . Hence Ed =Eo/pc. 



Similarly we can show that 



E k}=^; A ? -4-2-Ih-IpP] 



(17) 



so that the matched-filter interference is 

A 2 



=A-[(/sr-i)a4 2 +^A 2 ]-2[i+|p| 2 ] 



(18) 



A 2 K 
IN, 



[« + ^ c 2 l4 2 -2[l+|p| 2 ] 



Finally, the MUD efficiency is 



ft -1 ! MUD _ { ( E d\ 



(19) 
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4. Conventional Channel Estimation 

The conventional channel amplitude estimate is given by 

M mts , 

= XX%i;^[m^ (20) 



Jfc=l p=] m m m=l iW m*l 

*=1 p=l 



where 



/J m '] s -i-f b,[m]b k [m-m'} (21) 
M m=1 

In the above bt[m] represent the known pilot bits. (The Ah virtual user is implicitly a control 
channel.) The number M represents the number of pilot bits used to derive the channel 
amplitude estimates. The channel amplitude estimate can be rewritten 

K r L 

-°<p + iix* ■«* (22) 



It is shown in the appendix that 

**-*wU = 5 avv £ K» i 2 U (23) 

Hence the variance of the estimate is 
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*fl«* i 2 :Mta, ? i 2 }=£ *R„ r}-Ap ^ i 2 R -m (24) 



p=l JM p=i 

p*q k*l 

K v L 



p=l A=l P=I 

The factor p £ simply reflects the fact that the off-diagonal elements are smaller than the 
diagonal elements due to partial correlations p*p between the antenna elements. In the 
Appendix it is also shown that 

e {\ h vp\ \*p-jr 

f 2 1 l (25) 

Now combining Equations (24) and (25) gives for the variance of the channel amplitude 
estimate 



K = 1 4 »» I 2 } K + ii*6 »w I 2 } K +K 



p*q k*l 

where we have used A\* = A 2 /?-. The first term represents the variance due to a user's 
own multipath interference. This term is small compared to the variance arising from the 
total multiple-access interference. For simplicity we incorporate part of this term into the 
second term and drop the remainder. The final term represents thermal noise and other- 
cell interference. For now we assume that thermal noise in small. The interference arising 
from other cells is assumed to be proportional to the same-cell interference, with a 
constant of proportionality /= 0.35. With these assumptions we have 
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(27) 



9=1 



Notice that the magnitude of the error E ; is approximately the same for all users. Also, the 
Ah users is implicitly a control channel, and hence N f = PG = 256. If the K v virtual users 
are all at the highest spreading factor, then in terms of the K = KJ2 physical users we 
have 



E* =(!+/) 



MPG 



[Kp 2 c A 2 + KaA 2 ] 



(28) 



where Ec is the magnitude of the channel amplitude error for a control channel, p c is the 
relative control channel amplitude, A is the amplitude for the data channels, and where a 
is the activity factor for the data channels. Since the channel amplitudes for the data 
channels are determined by scaling the amplitude of the corresponding control channel it 
is evident that Ed = Ec//} c . Hence, 



(A J W n M PG[ #J 



(29) 



Given the parameters 



f 


= 0.35 


K 


= 128 


L 


= 4 


M 


= 18 


PG 


= 256 


a 


= 0.4 


ft 


= 0.7333 



we get 




= J (1+0 35)i^L-J^l 
y (18)(256)[ (0.7333) 2 J 



(30) 



= 0.51 

The number of pilot bits, M, is taken to be 18, which represents 6 bits per slot, the 
amplitudes averaged over 3 slots. The corresponding MUD efficiency is 

9mm =1 ~(^J =1 -(° 51 ) 2 = 0 - 74 < 31 > 



163 



WO 02/073937 



PCT/US02/08106 



5. Improved Channel Amplitude Estimates 

One method for significantly improving the channel amplitude estimates is to perform a 
second estimate directly on the data channels after the initial data channel demodulation. 
Performance is improved for two reasons. First, the entire slot can be used for integration. 
Hence we have M = 3(10) = 30 bits. Secondly, the error is not scaled by 1//J C since the 
estimate is performed directly on the data channel. For this method we have 



A V 



= ^(1+0.35)^^[(0.7333) 2 +0.4o] (32) 



= 0.29 



and the corresponding MUD efficiency is 

^=l-(^j-l-(0.29) 2 =0.92 (33) 

Slightly better performance can be achieved by using both data and control channels. 
This method can be performed either on the daughter card or on the modem card since it 
is a single user method. The assumption is that the matched-filter BER is sufficiently 
good. 



6. Multiuser Channel Amplitude Estimation 

Given the conventional channel estimates and the detected user bits it is possible to 
subtract the MAI which corrupts channel estimation. This method of channel estimation is 
referred to as multiuser channel estimation, as opposed to the conventional single-user 
estimation techniques. A simple multiuser channel estimation technique is presented 
below without analysis. Performance should be determined via simulation. 

From Equation (22) the conventional estimate is 

a n=X"*'** +w * (34) 

kp. 

A multiuser estimate is obtained by subtracting the known interference among the channel 
estimates 
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kp*lq 

where the (hopefully) improved multiuser channel estimate is denoted a, q . The first term 

above is the actual channel amplitude. The second term is the residual interference, and 
the last term represents thermal noise and other-cell interference, which is amplified by 
the multiuser interference subtraction. The extent of the amplification needs to be 
determined. 




7. Effect of Uncancelled Multipath Interference 

It is expected that a typical RAKE receiver will be capable of tracking up to approximately 
16 multipath components. Since the computational complexity of symbol-rate MUD is 
quadratic in the number of multipaths L it is unlikely that MUD implementations will be 
able to cancel all multipath interference. The effect of uncancelled multipath is assessed 
below. , 

Suppose that the RAKE receiver processes U multipath components, but that the MUD 
implementation cancels interference for L < U components. From Equation (13) we have 

q-\ p=\ 

q=l p~\ ^ q=\ p~L+l 

(36) 

*> q=l p=I 

^ 9=1 p=i Z 9=1 p=L+l 

and the variance is then 
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Z ' V i ?=L+l p=l z ' v / q=L+lp=L+l 

t q=L+l p=l AN i q-L+l p=L+l 

l [ qz=\ p=l 9=1 p=L+I 9=Z-+1 p=l q=L±\ p=L+l J 

=M^ ] {iX2X + fa, + ^]t A5±Ail 

=^^- ] te +fa* +/u,+wJ4M?l 



p=L+l / p=l 



(37) 



Note that is the ratio of the uncancelled to cancelled interference for the Ath users. 
Similarly, we have 

E{rfi=?^^WAl + fa, t +PxJ + /J,A t K% 2 } 08) 

Now, neglecting the second order terms ft,//**,* and averaging over the users p x = E{p x j} 
we arrive at 
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***** £*fr.-*> a } 



2N, 




2-t+|p 


f) 


2N, 




2-6+|p 


?} 


IN, 



B liti£jj JM »(o + /3 e a ){4*+2M a } 



Note that & is the ratio of the uncancelled to cancelled interference. 



(39) 



In order to assess typical value for p x multipath models [1][2][3] were used to generate 
random profiles. The models are based on data collected in four areas (A, B, C, and D) in 
the San Francisco-Oakland bay area. Table 1 below summarizes the key results. The 
table shows the p x versus the number of multipath components L 



ft 


L = 8 


. L = 6 


' 1 = 4' 


1 = 3 . 


•'■;,i:=.2.- :•■ 


■1=1 


Area A 


0.0019 


0.0064 


0.0481 


0.0961 


0.2376 


0.5819 


Area B 


0.0012 


0.0086 


0.0404 


0.1115 


0.1416 


0.5749 


AreaC 


0.0004 


0.0054 


0.0291 


0.0948 


0.1649 


0.6603 


AreaD 


0.0039 


0.0128 


0.0430- 


0.0629 


0.1435 


0.4890 



Suppose fi x = 0.05 and (E/A) 2 = 0.51 2 = 0.260. Without taking uncancelled multipath into 
account we found fi M uo = 0.74. Taking uncancelled multipath into account we find 



Puud — 1 



=1- 



1 



1+2)3, 
1 



1 + 2(0.05) 
= 0.67 



[(0.5 1) 2 +2(0.05)] 



(40) 



where a worst-case fi x = 0.05 is used. 



167 



WO 02/073937 



PCT/US02/08106 



8. Improved MUD Efficiency Due to Dropping Small Amplitudes 



If small amplitude multipath components are not included in the cancellation the MUD 
efficiency is reduced slightly due to the additional uncancelled multipath interference, but 
it is also increased because of the absence error resulting from the inclusion of these 
small noisy estimates. The net effect is a substantial increase in the MUD efficiency. From 
Equation (30) we have 



[a ) j, m pg[ #J 



= (1+0.35) 



(128) 



L- °- 40 



(18)(256)L (0.7333) 



i 



(41) 



= 0.065 



where E d1 2 is the error due to a single multipath (i.e. L - 1). From Equation (37) it is 
evident that if a particular multipath amplitude satisfies < E d1 2 then it is advantageous 
not to incorporate this amplitude into the cancellation since the error is greater than the 
amplitude. Table 2 shows the mean number of paths E{L} which satisfy A kp 2 > E d1 2 and the 
ratio p x of the uncancelled to cancelled interference if only these mulitpaths are cancelled. 
The MUD efficiency is then calculated using 



P. 



■AfUP-1 1 + 2j g 



(42) 



Ta ble 2. Improved MUD efficiency (Pmud) due to dropping small amplitud es 





EM i- 




'■ ' ■A 'l' $MOb " 


Area A 


2.0300 


0.0714 


0.7638 1 


AreaB 


2.4660 


0.0691 


0.7482 


AreaC 


2.2970 


0.0680 


0.7564 


Area D 


2.0690 


0.0625 


0.7748 


Mean 


2.2155 


0.0678 


0.7608 



9. Conclusions 

This report represents a first-look at channel estimation and the effect of errors on the 
MUD efficiency. Only the case where all users are at the highest spreading factor has 
been examined. The initial results indicate that if the conventional channel estimates are 
used the MUD efficiency drops to 74% due to estimation errors. If the effect of 
uncancelled multipath interference is also considered the MUD efficiency drops down to 
67%. If small amplitude multipath components are not included in the cancellation the 
MUD efficiency is reduced slightly due to the additional uncancelled multipath 
interference, but it is also increased because of the absence error resulting from the 
inclusion of these small noisy estimates. The net effect is a substantial increase in the 
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MUD efficiency, which is increased to 76%. The actual MUD efficiency will, of course, be 
less due to other factors which degrade efficiency. If an improved single-user channel 
estimation is used the MUD efficiency can be increased to 92%. This improved method 
requires knowledge of the pre-MRC matched-filter outputs. It is perhaps possible to 
further increase the MUD efficiency by employing multiuser channel estimation. These 
techniques also require knowledge of the pre-MRC matched-filter outputs. The above 
referenced MUD efficiency numbers are based on 128 users processed by the 
basestation. If fewer users are allowed access to the system in order to increase range 
the MUD efficiency is unchanged shine the total interference and noise remains 
unchanged. 
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Appendix A 

In order to estimate the variance of the channel amplitude estimate we need the second 
order statistics 

m m' 

= ZXlJ-'rM* VL, ■ E{l lk [m] ■ J,,.[m'j} (A1) 

where we have used 

**W-^Drf^^--* r - V**, (A2) 

which is derived in Appendix B assuming random codes. In order to evaluate £{/£[m']} 
we consider two cases: 1) k = / , and 2) k * I . For k = I we have 
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E{ll [m']}= -L ■ fr, [«] • fc^m - m'] ■ [n - m']} 



m=l n=l 



whereas for k * I we have 



E{ll [m']}= -iyd X I>' [m] ' fc < W ' t m " m 'l m 'll 

M U=l n=l J 



J7t=l JI=1 

J_ 



(A3) 



^XIX 5 ™ (A4) 



Hence, combining Equations (A3) and (A4) we have 

E{lf k [m']}= 5 U .^i-±j+-L ( A5) 
Equation (A1) then becomes 

=> A i«{v4-i) t i} ,A6, 
Now specializing Equation (A6) to the case where k = I 

The above expression is further, simplified if we assume that users are approximately 
synchronous so that N !lqp [0] ~ N h which gives 

£ R/J^,=^- (A8) 
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Similarly, specializing Equation (A6) to the case where k * / 

1 

MN, 



£ KJ 2 L=7^T (A9) 



Appendix B 

In Appendix A we used the approximation 

^[m3C;, y [m']}=-l5 r 5 tt .5 w .5^ nm . (B1) 

under the restriction that lq*kp. We show here that this expression is exactly true for 
chip-synchronous users, and that the approximation is reasonably valid for chip- 
asynchronous users, particularly when differences in delay lag are greater than about 2 
chips. The analysis is based on random user codes. 

The user correlations can be explicitly related to the code correlations as follows 

(B2) 

Consider two cases: 1) / * k , and 2)1 = k. 
Case 1 

When / * * the second-order statistics become 
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where we have used the assumption of random user codes, independent among the 
users. Note also that the summation over i is over the range where c,[i] is non-zero, and 
similarly the summation over j is over the range where Cijj] is non-zero. 

Case 2 

Now consider case 2 where / = k 

4N t N v ff} 



4/v,/v r 



When / * V we have 



5,., 



whereas when / = /' we have 

^»W-C^[T']}=^£« f [T)-« P/ [T , ]-J?{:;[i]-c l [f].c,[/I.4[/]} 



= ^-{l?«W.g / , / tT I ]-25 (( ,.25 if . 

+ i#]'« f/ [T']'24['"]'c;i/]) 

i=i./V J 



(B4) 



-^gM**,w-»V» w (B5) 



(B6a) 



(B6b) 
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!=JJ"f J 

= 5 r4 .g[T]f[T , ]+^-|^g j) [T]g ) .[T']-^ 5 [T].«[T , ]J 



Hence combining Equations (B5) and (B6c) we have 



and combining cases for / * k and / = k we have 



(B6c) 



4W"C;.[Tl}=5, v gW-g[Tl+^jp f [T].g 9 [t 1 ]-W,g[T].«[T , ]J (B7) 



(B8) 



iy l ij Nf ij 

= S lk ■ S r ,g[T]. an- 5 *' 8 '-*" 8 [t]. gr 1+ *L*ML2siti. giJ [*•] 

yv ' if 

The above expression can be used to determine the second-order statistics for the 
general case of symbol-asynchronous and chip-asynchronous users with arbitrary 
spreading factors. In what follows we will be interested in approximating the above 
expression so as to get simple but meaningful results. In order to simplify the expressions 
we consider users all at the highest spreading factor, and we assume that certain small 
values are zero. 

To assess the accuracy of channel estimation we need to determine the second order 
statistics 

E{c ikqp [m] • c; Vpl [m']}= E{c ik [r tkqp [m]) ■ C^[T„ w [irf]J 
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with lq * kp . The function flMgM in Equation (B8) above is small unless both x and x' are 
close to zero, and for the chip-asynchronous case function is exactly zero since unless 
both x and x' are equal to zero. Since for lq±kp the probability that x ikqp [m] is close to 
zero is small a good approximation is to assume that these functions are zero. The third 
term can be written 



E{c !k [t] • c;,[r']} s ^{^Z », w - * f m| 



(B10) 



The double summation in the brackets 



"i if 

is plotted in Figure B1 for N, = Afc = 256 versus x - %' for (x + x' )/2 = 0. 



(B11) 




Figure B1 . Plot of S,a[x,x'] for A// = A/* = 256 
versus x - x' for (x + x' )/2 = 0. 

The sharp localization around x - x' = 0 is valid for all values of (x + x* )/2, except that for 
(x + x* )/2 large peak value drops off due to the partial overlap of the codes. Hence for 
delay lag differences x - x* greater than about 2 chips a good approximation is 



This approximation then gives 



E{c ik [r] • c;, [r' j}^ ■ 5„. ■ S lk [t,t] 



(B12) 



(B13) 



174 



WO 02/073937 PCTAJS02/08106 

which implies 

Efc aw lml C; w [m']}=±--8 u . ^ -8 pp . -5^. -S lk [r,x] (B14) 
provided the delay spread is less than a symbol period. Now it can be shown that 

(B15) 

where N lkxw [m'] is the overlap between the user codes. Our final result is then 

^ P W-C;,,,[ m ]} S 1.5,.5 tt ..^. <V 5 w -^7T^ (B16) 
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1. Multi-User Signal Model 

The Rake receiver operation described in the next section is based a signal model. The 
MUD algorithm and implementation are based on the same model. This model is 
described below. 

Figure 1 shows how the uplink complex spreading for the Dedicated Physical Data 
CHannels (DPDCHs) and the Dedicated Physical Control CHannel (DPCCH). There can 
be from 1 to 6 DPDCHs, denoted DPDCH k , for k from 1 to 6. If there is more than one 
DPDCH, then the spreading factor for all DPDCHs must be equal to 4. For a single 
DPDCH (DPDCHi) the spreading factor can vary from 4 to 256. The data bits for channel 
DPDCHi are spread by channelization code c di1 = C^sf.sfm, where SF is the DPDCH 
spreading factor. These channelization codes are referred to as Orthogonal Variable 
Spreading Factor (OVSF) codes. They are equivalent to Hadamard codes, except for their 
ordering. When there are multiple DPDCHs then dedicated channels DPDCH k , for /cfrom 
1 to 6 are spread by channelization codes c dtk = C chAn , where the relationship between n 
and k is represented in Table 1 . 

Table 1. Relationship between n and k. 



n 


k 


1 


1,2 


3 


3,4 


2 


5,6 



The data bits for the DPCCH are spread by code c c = C cht2 56 t o- The spreading factor for the 
DPCCH is always equal to 256. The multipliers p c and p d are constants used to select the 
relative amplitudes of the control and data channels. At least one of these constants must 
be equal to 1 for any given symbol period m. 
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CrfJ A P A 

DPDCH, KX) <X>-»- 

dpdch 3 — <x) <x)— •> 

DPDCH 5 <X) <X>-*> 



DPDCH 2 I<X) >®—> 

DPDCH 4 ><X) <X)— ► 

DPDCHg »<x) <X)— ► 

DPCCH H^X) <X>— * 



Q — <S> 



3.84 Mcps 



7 



Figure 1 . Uplink complex spreading of DPDCHs and DPCCH 

The uplink spreading for any one of the seven Dedicated CHannels (DCHs) above can be 
represented as shown in Figure 2. 



c[n] 



P 



b[m] 



i\ I 3.84 Mcps 
— <x) — Kx) ► 



Figure 2. A second representation of the uplink spreading 
for any one of the seven Dedicated CHannels (DCHs). 



where the code c[n] is given by 



<*«] = 



c 'dkjs6.«[n]'^[«]. 

C cA,256.64W-y^[n], 
C *256.I92 MS,*!"]. 

C ckjs6,m[n]-S sh [n), 
C ch.^s.mMjS A [n], 



DPCCH 
DPDCH, 
DPDCH 2 
DPDCH 3 
DPDCH 4 
DPDCH 5 
DPDCH 6 



(1) 



177 



WO 02/073937 



PCT/US02/08106 



and 



p 



4 P " 



DPCCH 
DPDCH 



(2) 



1-6 



For a DCH with a spreading factor less than 256 there are J = 256/SF data bits 
transmitted during a single 256-chip symbol period (i.e. 1/15 ms). From a signal model 
perspective, the J data bits transmitted per symbol period can be viewed as arising from J 
virtual users, each transmitting a single bit per symbol period. The idea is illustrated in 
Figure 3. 



c 0 W p 



b o [m]=b[0+mJ] » |256 




bj_,{m\=b[J-l + mJ] 



Figure 3. Transforming a single user with bit rate J bits per symbol period 
into J virtual users, each with bit rate 1 bit per symbol period 

The codes for these virtual users are formed by extracting SF elements at a time out of 
the DCH code sequence to form J new codes. Each of the J codes is of length 256 chips, 
but with only SF non-zero chips. That is, 



„ r n JcM, j SF£n<(j+V)- 
[ 0, otherwise 



SF 



(3) 



This code-partitioning concept is illustrated in Figure 4 for the case SF = 64 so that J = 
256/SF = 4 codes are derived from the one DCH code. 
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ESSE!] I I 1 



+ 

+ 

1 I I tm^a\ 

Figure 4. Code partitioning concept illustrated for the case SF = 64, 
whereby J - 256/SF = 4 codes are derived from a single DCH code. 

The control channel can also be viewed as a virtual user. Hence, for a given physical user 
with spreading factor SF there are 1 + 256 NdISF virtual users, where N D is the number of 
DPDCHs. (Recall that for N D > t , SF= 4.) 

It turns out to be convenient to use a double indexing scheme to i dentify virtual users. Let 
paired indices kj represent the jth virtual user associated with the /rth dedicated channel. 
Index / varies from 0 <= j < J k = 256/SF k , where SF k is the spreading factor for the /cth 
dedicated channel. For the remainder of this section the spreading factors SF k are 
assumed to be constant. In section 3 the equations are reformulated to allow for symbol- 
by-symbol changes in the spreading factor. 

The transmitted signal for virtual user /c/can be written 

m 

where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, b k] [m] are the data bits, and 
where v kJ [t] is the transmit signature waveform for virtual user kj. This waveform is 
generated by passing the spread code sequence c kJ [n] through a root-raised-cosine pulse- 
shaping filter h[t] 

s.jm^Kt-pNJc^p) (5) 

Note that fi k = fi c if the kjth virtual user corresponds to a control channel. Otherwise p k = 

A* 

The total number of virtual users is denoted 
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where K D is the total number of dedicated channels. The baseband received signal after 
root-raised-cosine matched-filtering can be written 

= X X & - mT ^j M + >*] (7) 

where wjR/is receiver noise with a raised-cosine power spectral density, and where s kj [t] 

is the channel-corrupted signature waveform for virtual user Ay. For L multipath 
components the channel-corrupted signature waveform for virtual user kj is modeled as 

^w=i^^-^^] (8) 

p=l 

where a*p are the complex multipath amplitudes. The amplitude ratios p k are incorporated 
into the amplitudes a*p. Notice that if /rand / are two dedicated channels corresponding to 
the same physical user then, aside from scaling the by p k and /?,, a kp and a /p , are equal. 
This is due to the fact that the signal waveforms of all virtual users corresponding to the 
same physical user pass through the same channel. The waveform s kJ [t] is referred to as 
the signature waveform for the kjth virtual user. This waveform is generated by passing 
the spread code sequence c^n] through a raised cosine pulse-shaping filter g[t] 

^W^g[t-pN c ]c kj [p] (9) 
Note that for spreading factors less than 256 some of the chips c kj [p] are zero. 



2. Rake Receiver Operation 

This section describes the operation of a typical Rake receiver. Figure 1 shows a 
representation of the received antenna data that is delivered to the Rake receivers of all 
users. 
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Figure 5. Received antenna data delivered to the Rake receivers of all users. 

The figure shows the received signals corresponding to users / and k These signals are 
combined in free space so that the receivers gets one composite signal, which we denote 
r[t]. The buffer length is assumed to be an integral number of frames in length so that 
delay lag values T !q are approximately constant with each new filling of the buffer. For 
each finger of each user there is a delay lag value T lq indicating the start of frame for the 
qth multipath of the Ah user. Lag values T !q are assumed to be constant over a frame, but 
are allowed to change from frame to frame in response to the delay locked loop operation 
and in response to new searcher-receiver sweeps where new delay lags are found. The 
lower case values x lq = T Iq mod 256N C denote the symbol-period offset relative to the start 
of an internal symbol period reference clock. Notice that the user spreading factors 
change on user frame boundaries. Since users are asynchronous it is impossible to have 
a MUD processing frame that corresponds to all user frame boundaries. Hence the MUD 
processing frame is matched as close as possible to the user frame boundaries, but does . 
not necessarily correspond precisely to any user's frame boundary. Consequently there 
will be spreading factor changes that occur during a MUD processing frame. Handling 
these mid-frame changes is the subject of section 3 below. 

The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. Since the spreading factor for the 
DPDCHs is not known, the Rake receiver performs an initial 4-chip despreading over all 
DPDCHs. The Fast Hadamard Transformation (FHT) can be used here to reduce the 
number of operations. The detection statistics for the multiple fingers and multiple 
antennas are maximal-ratio combined. Since the DPCCH is always spread with a 
spreading factor of 256 the DPCCH can be entirely despread during each symbol period. 
TFCI bits are extracted each slot from the DPCCH. After an entire frame is processed the 
TFCI is decoded and the spreading factor for that frame is determined. After spreading 
factor determination the final DPDCH despreading is performed. The resulting detection 
statistics are denoted here as y kj [m], the matched-filter output for the kfih virtual user for 
the mth symbol period. Since there are K v codes, there are K v such detection statistics, 
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which are collected into a column vector y[m] for the mth symbol period. The matched- 
filter output yn[m], for the /Ah virtual user can be written 



where a lq is the estimate of a lq , f k is the estimate of r lg , and N, is the (non-zero) length 

of codes Cn[n] (i.e., the spreading factor for the /th dedicated channel). The intermediate 
result yu,q[m] represents the despread signal at the gth lag, and is here referred to the pre- 
MRC matched-filter output. When multiple antennas are employed, r[t], ynjm] and a lq are 

column vectors with one complex element per antenna. 

The matched-filter detector estimates the transmitted data bits as b^mj^signly^m]} . 
Multiuser detection is considered in the next section. 

3. Multiuser Detection Equations and Asynchronous Processing 

As shown in Figure 5 a MUD processing interval must necessarily by asynchronous with 
most user's frame boundaries since the users are asynchronous. Because of this 
spreading factors will change during a MUD processing frame. When the spreading factor 
changes during the processing frame the MUD equations are modified. These 
modifications are considered in this section. 

The modem delivers matched-filter data to the MUD function on a frame-by-frame basis. 
Let N P [r] represent the number of physical users accessing the system during frame r. For 
each frame the following data is received for physical users p = 1 to N P [r] and each 
dedicated channel / 

• Number of DPDCHs, N DfP 

• Spreading factor, SF f 

• Amplitude ratios p d and p c 

• Slot format 

• Channel amplitude estimates a^ 

• Channel lag estimates T iq 

• Matched-filter outputs Urn] for all DCHs 

• Code numbers 

• Gap information for compressed mode 

Matched-filter outputs fjm] correspond to the matched-filter outputs yrfm]. If the Ah 
dedicated channel is a DPCCH then matched-filter outputs are only received for the TPC, 
TFCI and FBI bits. The frfm] values are mapped to the yjm] values as described below. 
The mapping accounts for the frame offsets between the various users. The amount of 
matched-filter data received per physical user depends on the DPDCH spreading factor. 

For each dedicated channel a symbol offset m, is determined according tn 




(10) 



182 



WO 02/073937 



PCTAJS02/08106 



m. = 



v (256N C ) 



(11) 



where div denotes integer division (i.e. with truncation). The symbol offset represents the 
fact that the users and hence the frame data are asynchronous. The y-data used for 
interference cancellation is derived from the frame data using 



(12) 



Figure 6 shows an example mapping of user data frames to MUD processing frames. To 
illustrate concepts the frames are each 16 symbol periods long rather than the actual 150 
symbols for WCDMA. The height of the blocks represents the number of virtual users per 
physical user. For physical users 1 and 4 the spreading factor changes in going from data 
frame 1 to data frame 2. As shown in the figure this results in spreading factor changes 
within the MUD processing frame. The MUD function is designed to Calculate the C- 
matrix once per frame. Hence mid-frame changes to user spreading factors pose a 
problem which requires special treatment. It turns out, and will be shown below, that mid- 
frame changes to the spreading factor can be accommodated by performing modified 
calculations based on the minimum spreading factor over the MUD processing frame. 



User data frames 
Frame 1 Frame 2 



MUD processing frames 



Frame 1 



Frame 2 



Physical User 1 
Physical User 2 

Physical User 3 
Physical User 4 




Figure 6. Mapping of user data frames to MUD processing frames. 



First we develop the MUD matrix signal model which allows user spreading factors to 
change on a symbol-by-symbol basis. We then show how we can perform the processing 
based on the minimum user spreading factors over the MUD processing frame. 

Let us reformulate the signal model presented in section 1 so as to allow spreading 
factors to change every symbol period. For every DCH k, there are J k [m] virtual users, 
where index m is the symbol period index. The number of DCHs J k [m]\s 

J k [m\ = 256 (13) 
SF k [m) v ' 
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where SF k [m] is the spreading factor for the Mh dedicated channel during the mth symbol 
period. The signature waveform for the jth virtual user of J k [m] total belonging to the /cth 
DCH over the /77th symbol period can be written 

^} = %g[t-pN c ]c^ m [p] (14) 

p=0 

where the codes and hence the signature waveforms now include the symbol-period index 
m to account for symbol-by-symbol spreading factor changes. The channel-corrupted 
signature waveform is then 

^>W = t%^>[^^] (15) 

Hi 

and thus the received signal corresponding to K D dedicated channels is 

= X£ ^ K« & - «Tl^[m] + Hffl (16) 

4=1 m y=0 

The MUD matrix signal model proceeds from substituting the received signal r[t] from 
Equation (1 6) into Equation (1 0) for the matched-filter outputs 

« k=\ ;=0 

(17) 

where ty/m/ is the match-filtered receiver noise and Ni[m] « SF{mJ. The terms for m'<>0 
result from asynchronous users. 

The delay lags %\ q for a given DCH / will under most circumstances be grouped within a 
range of from 4 to 8 jjs. Under extreme conditions the delay spread will be as high as 20 
Us. In any event, let T, represent the mean delay lag x !q over index q. According to 
Equation (10) above, the matched-filter detection statistic yJO] is the result found by 
correlating the received signal starting roughly at delay lag % h where x/ is approximately in 
the range 0 to 256Ak If x f moves significantly outside this range an adjustment in the 
symbol period alignment will need to be made to restore t/ back to within the desired 
range. More will be said about this below. Along the same lines, the detection statistic 
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y u [m] is the result found by correlating the received signal starting roughly at delay lag t/ + 
mT. 



For efficient MUD processing it is important for the C-matrices to be constant over a 10 
ms MUD processing frame. We now describe a method which operates on constant C- 
matrices. Handling changes to user spreading factors is relegated to the IC portion of the 
MUD processing. Let us define 

J ; -maxJJm] (18) 

HI 

where the maximization is over symbol periods m that contribute to the current MUD 
processing frame. This includes not only symbol periods that fall within the MUD 
processing frame, but in addition a few symbol periods on either side due to 
asynchronous users. Note that the minimum spreading factor for the fcth DCH is SFk = 
256/J k . Now define the DCH contraction factor iox the mth symbol period as 

C k [m) = -^— (19) 
* Aim] 

The DCH codes for a given symbol period can be expressed as a sum of the DCH codes 
corresponding to the minimum spreading factor. For the /cth DCH there are at most 
virtual users corresponding to the minimum spreading factor. Let the codes for these 
users be denoted c^r], 0 <= j < J k . The codes for the mth symbol period, where there 
might be fewer virtual users, are denoted c^r], 0<=J< Mm], where 



W1= 2> V M (20) 



With this result we are now able to represent the MUD signal model in terms of the C- 
matrix and R-matrix elements based on the codes corresponding to the minimum DCH 
spreading factors. The C-matrix in Equation () above becomes 

<Wm.n] ■ ^^ZX*K r - s)N c + On -n)T +f % -T J< Jr] • c kjM [s] 

= 1~, I I ^-II«[(r-^ c + (m-»)r + f, r T,]c;W-c t; W 

' V /I/WJ fWC|tm] j'Wqin] r s 

=T^T X 2 <W»-»] 

M,\m\ ,w- C ,tm) j-=jC,ln] 

C mjqp [m-n] s J-YL8Krs)N e Hm-n)T+t lg -T^M-c^*] 

r s 

12-H 
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where N^min N ( [m]= SF,. Similarly, the R-matrix becomes 

1 1 r i 



' V /L"*J r=/C|[m] /=;C*ln) 9=1 P=l 
NJ (W)C,[mH (;+l)CanH 

= 777^ X X (22) 

'V/L/MJ }'^c,[m) /=;.Q[n] 



r //jy [m - /i] = £ ]T Re{5£ • C,,,^ [m - n}\ 



so that the matched-filter outputs become 

K D J k [n]-\ 
n Jt=l j=0 



n Jk=l ;=0 

*n 7^H-J f (<+l)-C,[/n]-l U+l) C k {n]-l 1 

=XXX -577^ X X %/[^#«w+^m 

" *=i ;=o [^/L'^J /w-Qtm] /WAUJ J 

This last equation can be written 

a'„ JM-l f (/+l)-C,t/nH 0+l>A[«H 1 

3>„I'n] = XX X 777^ X X ^N-"]K'W+^N 

n k=] j=0 [^V ; Lmj /W . Cj[m) j-^j-C t tn] J 
^ (M)-C,(m]-1 

= aTT^ X y/rW+r7/,M 

^/l^J r=/-Cj[/n] 

y/rM=EE X i X W m -«ir*M 

= 1X2. S Wm- n ]b r [«] 

= XX Xw m - *i ■ m*i 



(23) 



(24) 



where we have defined b ¥ //?y = /fy/b;for |C*/nJ / < (j + 1)C k [n]. Equation (24) is based 
entirely in terms of matrix elements corresponding to the minimum spreading factor for the 
MUD processing frame. 
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1 . Introduction 

Multiuser Detection (MUD) is most often thought of as a technique to improve either 
capacity or coverage for the uplink, A few reasons why MUD is uplink-focussed are 

• Downlink MUD must be performed in the handsets, which are limited in processing 
power 

• Each handset is interested in only one signal 

• In the downlink users are separated by orthogonal codes 

However, there is typically a greater demand for capacity in the downlink. If MUD is only 
applied in the uplink the imbalance is even greater. While in the downlink users are 
separated by orthogonal codes, because of multipath there is still significant intra-cell 
interfernece. Equalization has been suggested as a means of restoring orthogonality, 
however the computationally attractive linear equalization methods tend to amplify the 
othe-cell interference and noise. 

A downlink MUD method is described in the next section which has reduced complexity. 
The Fast Hadamard Transform (FHT) is used to reduce complxity. The FHT is used in 
both the forward (demodulation) and backward (regeneration) directions. 



2. The Method 

The method proceeds according to the following steps 

• Receive amplitude and delay information form the searcher receiver 

• Start with the largest multipath 

• Multiply the received signal by the conjugate of the scrambling code (512 chips at a 
time) 

• Perform the FHT on the result (for multirate users, this is done in stages) 

• Determine soft data estimates 



188 



WO 02/073937 



PCT/US02/08106 



• Set user-of-interest data symbols to zero. 

• Do same for all multipaths 

• Proceed till end of slot 

• Estimate amplitudes and gain factors 

• Diversity combine results and make hard decisions 

• Use hard decisions, gain estimates and FHT to reconstruct chip sequence c/n/.(with 
user of interest nulled) 

• Multiple c[n] by c S h[n] to form d[n] (with user of interest nulled) 

• Use amplitude estimates, delay lag estimates (from searcher) and raised-cosine pulse 
to construct chip filter 

• Pass d[n] (with user of interest nulled) through chip filter to reconstruct interference 
signal 

• Subtract interference signal from received signal 

• Demodulate with conventional RAKE receiver 



The WCDMA transmitted signal can be represented as 



where g[t] is the raised-cosine pulse 1 , N c is the number of samples per chip, and d[n] is 
the composite chip sequence from all users. The received signal is then 



s[t] = ^g[t-nN c ]d[n] 



n 




= c[n]-c sh [n] 



K 



c[n]^G k b k [ndiv N h ]-c chtk [n] 



L 



q=] 
L 



0 



The received signal advanced to the delay of interest is 



The chip-matched filter is artificially placed In the transmitter for simplicity of preser 
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r[iiN c +T p ] = £a,s(nAr c +x p -t q ] 
i 

L 

9=1 m 

The received signal multiplied by the conjugate of the scrambling codes is 

L 

r[nN c + x p ] • <4 [n] = X*, X «I t p "^ g "^N c + n]c jA [m + n] - [n] 

9=1 m 



0 



= a / ,«c[/z] + >v[n] 



c[/z] + w[n] 



0 



This result can now be demultiplexed using the 512 x 512 FHT. Since 512 = 2 9 , the FHT 
proceeds in 9 stages. After the first two stages the SF 4 symbols can be extracted. 
Similarly, after k stages the SF 2 k symbols can be extracted. The amplitudes a p can be 

determined from the embedded pilot symbols, or searcher-receiver estimates can be 
used. If embedded pilot symbols are used the measurements of the pth multipath of 
the /cth user is in the form 



M pk =a p G k 



0 



which includes the user gain factor. After measurements are taken for all multipaths and 
all users for a given slot, the multipath amplitudes and user gains can be separated by 
determining the dominant left and right singular vectors of the rank-1 matrix M pk (aside 
from an arbitrary scale factor which can be given to either amplitudes or the gains). One 
the approximate amplitudes a p are known the actual amplitudes a p are determined by 

inverting the diagonally dominant system of equations 



9*1 
L 



X^ pi a q 
9=] 



0 



The chip filter h[t] for reconstructing the interference signal is 
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L 

q=l n 

= %h[t-nN c ]d[n] 

n 
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30 

31 1 Purpose 

32 The purpose of this memo is to document parts of the discussion we have been 

33 having on how the TI 6414 DSP may connect to the raceway. 

34 
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35 2 Glossary 

36 EMEF - A port on the DSP 6000 series peripheral bus which allows the 

37 connection of memory devices. 

38 SDRAM - In the context of this memo, means the main external memory of the 

39 TI DSP - the one which contains the program and data. 

40 3 Overview 

41 

42 So far, a proposed architecture is that we use the second EMIF (External 

43 Memory Inter-Face) of the TI 6414 DSP to connect to a dual ported RAM. 

44 Raceway transfers actually access the RAM, and then additional processing takes 

45 place on the DSP to move the data to the correct place in SDRAM. In fact, if the 

46 dualport RAM is not large enough to buffer an entire Raceway transfer, then there 

47 will have to be a messaging protocol between the two endpoint DSPs wishing to 

48 exchange messages (because the message will have to be fragmented in order to 

49 not exceed the reserved buffer space). 

50 An additional restriction of this design is that as more Raceway endpoints are 

51 added, the size of the dualport RAM needs to be increased, or the maximum 

52 fragment size needs to shrink, such that the RAM is big enough to contain at least 

53 2*F*N*P buffers of size F, where F is the size of the fragment, N is the number of 

54 Raceway endpoints with which this DSP can exchange messages, P is the number 

55 of parallel transfers which can be active on any endpoint at a time, and the 

56 constant 2 represents double buffering so that one buffer can be transferred 

57 to/from the Raceway, while a second buffer can be transferred to the DSP. The 

58 constant becomes 4 if you want to be able to emulate a full duplex connection. 

59 With a 4 node system, this might be 4*8K*4*4 or 5 1 2K plus a little extra for 

60 bookkeeping information. This probably means the minimum size is 1M bytes for 

61 the dual port device. 

62 4 Problem Identification 

63 There are several characteristics of this architecture which could prove 

64 problematic: 

65 4.1 Requirement For A Fragment/Defragment Protocol 

66 Raceway transfers can currently be very long. This architecture would require 

67 a protocol for breaking transfers down into fragments. If the DSP is sourcing a 

68 transfer greater than the fragment size, then it has to either dedicate itself for the 

69 period of the transfer to programming the DMA engine, or it has to respond to 

70 interrupts as each fragment is transferred. In either case, there is a substantial 

71 performance impact above and beyond the normal performance hit due to 

72 memory bandwidth utilization. 
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73 If the DSP is on the receiving end of a Raceway transfer, a similar process has 

74 to take place, except that there must be an interrupt to get the attention of the DSP 

75 (polling would not be sufficient in such a case). 

76 Beyond the performance hit such a protocol would impose on the DSP, there is 

77 a major disadvantage in that only endpoints willing to implement this protocol can 

78 exchange data with the DSP. It is in effect, defining a defacto standard subset of 

79 Raceway. This is a major interoperability issue (you can no longer plug a board of 

80 DSPs into a fabric and have them work as a standard Raceway Adjunct 

81 Processor). 

82 4.2 Requirement For The DSP To Be Running Code 

83 If the DSP is involved in the Raceway transfers, then the DSP must already be 

84 running in order to perform Raceway transfers. This will require that all nodes on 

85 the Raceway be self booting. 

86 4.3 Lower Transfer Rates 

87 Raceway is less efficient with smaller transfer sizes. If the fragment size is kept 

88 small to minimize dual port ram requirements, then aggregate Raceway transfer 

89 rates will be lower because of less effiicient utilization of the fabric. 

90 4.4 It Is Different 

91 By changing the way Raceway works, we initiate a significant departure from 

92 the way all current Mercury systems work. While there are many other possible 

93 architectures which will perform well, it is inherently risky to change a 

94 fundemental model of how our multiprocessors communicate. 

95 5 An alternative Architecture 

96 It may be possible to implement a different architecture which addresses some 

97 of these shortcomings. 
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98 
99 

100 

101 5.7 Architecture Description 



102 The proposed architecture still has approximately the same hardware as the 

103 existing architecture. The changes are in the way that the Raceway transfers move 

104 between SDRAM and the Raceway. 

105 In the proposed architecture, the FPGA connects to both the buffering device 

106 (dual port RAM or FIFO) and the DSP. The connection to the buffering device 

107 (hereafter FIFO) is used to move Raceway data to/from the FIFO. 

108 The second connection is to the DSP Host Port. Dave currently believes this is 

109 a moderately high performance interconnect - on the order of 75 Mbytes per 

110 second. This interconnect could itself be used to move data to/from the DSP. The 

111 host port can access data in the DSP on-chip memory, as well as any of the 

1 12 peripheral devices, including the SDRAM. However, 75Mbytes per second is 

113 pretty slow compared to normal Raceway bandwidth, and we think we can do 

114 better. 

115 The 6414 contains a second EMIF which can be attached to the FIFO (this is 

116 similar to what the current architecture proposal intends). The difference in this 

117 proposed architecture is that rather than have the DSP program the DMA engine 
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118 to move data between the FIFO and the DSP/SDRAM, we propose that the FPGA 

119 can program the DMA engine directly via the Host Port. 

120 The Host Port is a peripheral like the EMIF and the Serial Ports. The difference 

121 is that the Host Port can master transfers into the DSP datapaths, i.e. it can read 

122 and write any location in the DSP. Because the Host Port can access the DMA 

123 Controller (we think), it can be used to initiate transfers via the DMA engine. 

124 The advantage of this architecture is that Raceway transfers can be initiated 

125 without the cooperation of the DSP. Thus, the DSP does not have to be self 

126 booting. Performance is increased in two ways: the DSP is free to continue to 

127 compute while Raceway transfers take place, and performance on the Racway is 

128 increased because there is no need to fragment messages. 

129 The internal datapaths of the DSP are flexible enough that we can control 

130 which devices have priority access to memory and datapath. Specifically, we can 

131 choose to give Raceway transfers priority over the CPU, or vice versa. 

132 5.2 Synchronization Issues 

133 There is an issue to be solved in how we match data rates between Raceway 

134 and the DSP. The EMEF looks to the DSP as if it were a memory, thus it is 

135 reasonable for the DSP to assume it can get at the data it needs at any time. 

136 However, if we indeed use a FIFO to buffer data, the implication is that there is a 

137 way to hold off the DSP when we are waiting for the Raceway to empty or fill our 

138 FIFO. A possibility is that the buffer device remains a dual port RAM rather than 

139 a FIFO, and the FPGA actually does a fragment/defragment into the RAM, and 

140 then programs the DMA engine to move that fragment into/out-of the DSP. This 

141 starts to look somewhat like the original architecture, except that because the 

142 FPGA performs the frag/defrag, the actual transfers over the Raceway can be 

143 arbitrarily sized (assuming we can throttle the Raceway). 

144 Synchronization remains one of the larger problems to be . solved with this 

145 proposed architecture. 

146 5.3 Sample Transfers 

147 In order to illustrate how this architecture would work, two examples are 

1 48 given. The first example is when the Raceway attempts to read data out of the 

149 DSP memory. 

150 5.3.1 Raceway Reading DSP Memory 

15 1 In this example, we assume that another DSP is trying to read the SDRAM of 

152 the local DSP. 



The FPGA detects a Raceway packet arriving, and decodes that it is a read 
of address 0x10000 (for instance). 

The FPGA writes over the Host Port Interface in order to program the 
DMA engine. It programs the DMA engine to transfer data starting at 
location 0x10000 (a location in the primary EMIF corresponding to a 
location in SDRAM) to a location in the secondary EMIF (the buffer 



153 
154 

155 
156 
157 
158 



1) 

2) 
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159 device/FIFO). As data arrives in the buffer device, the FPGA reads the 

160 data out of the buffer device, and moves it onto the Raceway. When the 

161 proper number of bytes have been moved, the DMA engine finishes the 

162 transfer, and the FPGA finishes moving data from the FIFO to the 

163 Raceway. 

164 5.3 .2 Raceway Writing DSP Memory 

165 In this example, we assume that another DSP is trying to write to the SDRAM 

166 of the local DSP. 

167 1) The FPGA detects a Raceway packet arriving, and decodes that it is a 

168 write of location 0x20000 (for instance). 

1 69 2) The FPGA fills some amount of the buffer device with data from the 

170 Raceway, and then: 

171 3) The FPGA writes over the Host Port Interface in order to program the 

172 DMA engine. It programs the DMA engine to transfer data from the buffer 

173 device (secondary EMIF) and to write it to the primary EMIF at address 

174 0x20000. 

175 4) At the end of the transfer, we could either interrupt the DSP to signal that 

176 a Raceway packet has arrived, or we can use the standard Mercury method 

177 of polling a location in the SMB to see whether the transfer has completed 

178 yet. 

179 5.4 Additional Thoughts 

180 1) We need to verify that the Host Port Interface can program the DMA 

181 engine. The documentation on the 6201 clearly states that it can write to 

182 any location in internal memory, and to anywhere on the peripheral bus, 

1 83 however the DMA engine/controller is the datapath controller for all that, 

184 so it is always possible that there is a special case which does not allow 

185 writing of the DMA engine/controller registers fipm HPI. The chance of 

186 this being so is quite remote, but needs to be verified. 

187 2) We need to understand the transfer rates and latencies of the HPI. This 

188 architecture relies on fairly low latency access through the HPI, otherwise 

189 more buffering space would be required, and at some point bandwidth 

190 begins to be affected. 

191 3) We need to understand the limitations of Raceway with respect to 

192 throttling, etc. The best case would be that Raceway can provide data as 

193 fast as the EMIF can take it (so we wouldn't worry about having data 

194 ready when EMIF wanted it), and also for Raceway to be able to be 

195 throttled so that it can take the data at the rate the EMIF can provide it. 

196 The more the reality deviates from this best case scenerio, the more extra 

197 logic is required in the FPGA until at some point complexity may prevent 

198 the architecture from being viable. 
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199 4) What we currently know about the 6414 is actually educated guesses 

200 based on documentation of earlier DSPs. We are making some 

201 assumptions about how TI will have enhanced their chip. 

202 5) If7when TI ever puts a RapidIO interface on their DSPs, it will almost 

203 certainly look like a high speed HPI, i.e. it will sit on the peripheral bus, 

204 have a separate datapath channel, data corning in will simply flow to the 

205 correct addresses, and outgoing data transfers will happen by 

206 prograrriming the DMA engine to send data to the RapidIO peripheral 

207 address. This proposed architecture looks almost exactly like that, and so 

208 probably will not require major changes to use a RapidIO enhanced DSP. 

209 6) There are probably more thoughts., but this is probably a good start. . . 
210 
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6201 Design Options 



RACEWAY 
< ► 




Option 1 



SDRAM 



RACEWAY 




Option 2 



LOCAL 




RACE 




LOCAL 


SDRAM 




SDRAM 




SDRAM 



Option 1 is the original proposal submitted at the DSP meeting Monday. Option 2 was created during the 
meeting. 

The main shortfall in Option 1 is the sharing of the EMIF bus between the 6201 and the Raceway 
DMA FPGA. During DMA operations over the Raceway, the 6201 will not have access to the EMIF 
interface. Any data or instruction fetches from SDRAM will stall. Given the relatively small size of the 
internal SRAM, this will impose a significant penalty to the operation of the 6201 . Option 1 also requires 
the FPGA to take over SDRAM refresh operation when it takes control of the EMIF bus. This passing back 
and forth of the refresh task will not be clean. 

Option 2 places a bi-directional transceiver between the 6201 's EMIF bus and the Raceway 
SDRAM. This allows the 6201 to process data and fetch instructions without any interruption from it's 
local SDRAM while the DMA FPGA is accessing the Raceway SDRAM. The HPI interface is used by the 
6201 to program the DMA engine and by the DMA engine to indicate the DMA complete status to the 
FPGA. Option 2 also lends itself to a dual 6201 node per raceway interface. Decode logic, controlling 
access to the Raceway SDRAM can be designed in a number/combination of ways: 

Total access to both 6201s 

Separate areas for each 6201 

Read but no write to the other 6201's memory space 

A separate common area accessible to both for message passing 

The ability of one 6201 to go through the transceiver to the others local SDRAM (not recommended) 
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For a migration story to the 6414, Option 2 is a better sell, Option 3 shows the 6414 design, the transceiver is 
stripped off and the Raceway SDRAM is connected to the second EMflF. The design will go to one DSP per 
raceway due to the increased in processing power of the 6414. 



RACEWAY 




Option 3 



LOCAL 
SDRAM 



RACE 
SDRAM 
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Jonathan Schonfeld 



Date: 23-FEB-2001 



From: 



Nmf 



Subject: 



An Efficient WCDMA Receiver Design based on File Ref: mjv-019- 

the FFT efficient_wcdma_receiver.doc 



1. Introduction 

Typical processing: 

Signal is sampled at N samples per chip. 
Despread by 

upsampling chipping sequence by interpolating and using the RRC chip pulse matched filter as an 
interpolation filter 

Multiplying digitized receive signal by upsampled and interpolated chip sequence 
Accumulate (integrate) results for an entire DPCCH symbol. 

Repeat at the early lead and late lag sample offset values to calculate delay locked loop variables 
Sweep the code correlator N*256 lags to determine code synchronization and channel response 

Spreading sequence is 256 chips long 

Typical filter is 12 chips long 

typical oversampling rate on the receiver is N=8 

Key calculations 

Interpolation of the spreading code - precomputed and stored 

Correlation process: N*256 CMAC 

Correlation repeated for N*256 + 2 (DLL) times 

Total CMACS: N*256 * (N * 256 + 2) = N A 2*65536 + 512 * N 

For N = 8, this results in: 4,198,400 CMAC 

1 CMAC = 4 RMUL + 2 RADD = 6 ROP 

Results in 25,190,400 Real operations 

At 15000 Hz symbol rate, need: 378 GOP/s 
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2. A New Design 

Use of EFT to perform efficient circular convolution of spreading code sequence 
Results in 

Short code synchronization ( chip sync only, not slot or frame ) 

DPCCH demodulation 

Early and late Delay Locked Loop variables 

Rough channel estimate values for an entire symbol worth of differential delay 
Polyphase signal processing 

Digitize the signal at an Nx oversample rate and filter with the RRC filter and split into N streams at 
the Ix rate. 

Compute the complex conjugate of the FT of the spreading code sequence at the chip rate - 
precomputed and stored 

Computation: 

Filter data at Nx oversample rate and split into N streams at lx rate 



For each stream, 

Compute 256 point FFT 

Complex multiply FFT with stored FFT values of spreading code 
Inverse 256-point FFT 

Ops calculation: 

Input filter: could be done using FFT as well. 

but for time domain processing: 8*256 points, filter length 96 => 

96 RMUL per point, 95 RADD per point, 

Total of 19608 RMUL, 194,560 RADD per symbol = > 391,168 ROP per symbol 
I and Q streams, => 782336 ROP 



Stream processing ( 8 streams ) 

Radix 4 FFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
256 CMUL = 1536 ROP 

Radix 4 IFFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
TOTAL per stream: 71168 ROP 

Total stream calcs: 569,344 ROP 

Total ops per second at 15000 Hz symbol rate is: 20.3 GOPS 
more than 18 times more efficient than traditional approach. 

Also, the DLL circuitry can be eliminated since the entire channel response is calculated at the 
symbol rate. 

FFT numbers may be off by a factor of 2 larger in the number of complex multiplications needed. 
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Algorithm for the UMTS FDD Uplink 

John H. Oates 
Mercury Computer Systems, Inc. 
Wireless Communications Group 

199 Riverneck Road 
Chelmsford, MA 01824-2820 USA 
Tel: 978-256-0052 x!659 

FAX: 978-256-8596 
E-mail: ioates@mc.com 
Technical Area: 03 



Introduction 

Multi-User Detection (MUD) has been shown to provide a number of significant benefits[l][2]. These 
include increased system capacity, increased range, enhanced Quality of Service (QoS), improved near-far 
resistance, extended battery life, and reduced handset transmit power. This paper describes the practical 
implementation of Multi-User Detection (MUD) for the UMTS uplink using short codes. Hie focus is on 
practical implementation details such as efficient implementation of the calculations, processing 
requirements, latencies, MUD efficiency, and mapping to hardware. 

The use of short codes allows MUD to be performed at the symbol rate. As such MUD can be introduced 
into a conventional Base-Transceiver-Station (BTS) as an enhancement to the Matched-Filter (MF) RAKE 
receiver. The MUD processing takes the MF detection statistics, performs interference cancellation, and 
then delivers improved hard or soft-decision symbol estimates to the symbol-rate BTS processing 
functions. The MUD processing introduces only a few milliseconds latency. Because of the reduced 
computational complexity of MUD operating at the symbol rate the entire MUD functionality can be 
implemented in software on a single card or daughter card populated with a minimal number of processors. 
We present here an implementation of an iterative hard-decision Interference Cancellation (IC) algorithm 
on four Power PC 7410 processors. The processors are connected together with a high-bandwidth RACE++ 
interconnect fabric. 

In order to perform MUD at the symbol rate the correlation between the user channel-corrupted signature 
waveforms must be calculated. These correlations are stored as elements of matrices, here referred to as the 
R-matrices. Since the channel is continually changing these correlations must be updated in real time. 
There are two elements to updating the R-matrices. The first part is based on the user code correlations. 
These depend on the relative lag between the various user multipath components. It is assumed that these 
lags change with a time constant of about 400 ms. The second part is due to the fast variation of the 
Rayleigh-fading multipath amplitudes. It is assumed that these amplitudes are changing with a time 
constant of about 1.33 ms. The R-matrices are used to cancel the multiple access interference through the 
Multi-stage Decision -Feedback Interference Cancellation (MDFIC) technique. 



UMTS Uplink Multi-rate Signal Model and RAKE Processing 

We derive here the equations describing the MF outputs based on the WCDMA transmitted waveform. 
The users accessing the system will hereafter be referred to as pfiysical users. Each physical user is 
regarded as a composition of virtual users. Each virtual user transmits a single bit per symbol period, where 
by symbol period we mean a time duration of 256 chips (i.e. 1/15 ms). The number of virtual users, then, 
for a given physical user is equal to the number of bits transmitted in a symbol period. At a minimum each 
active physical user is composed of two virtual users, one for the Dedicated Physical Control Channel 
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(DPCCH)[3] and one for the Dedicated Physical Data CHannel (DPDCH). If the physical user is a data 
user with Spreading Factor (SF) less than 256 then there are J = 256/SF data bits and one control bit 
transmitted per symbol period. Hence for the rth physical user with data-channel spreading factor SF n there 
are a total of 1 + 256/SF n virtual users. The total number of virtual users is denoted 



(0 



The transmitted waveform for the rth physical user can be written as 



... . (2) 



where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the short-code length, 
N c is the number of samples per chip, and where ft = ft if the foh virtual user is a control channel and ft = 
ft if the fan virtual user is a data channel. The multipliers ft and ft are constants used to select the relative 
amplitudes of the control and data channels. At least one of these constants must be equal to 1 for any given 
symbol period m. The waveform srft] is referred to as the transmitted signature waveform for the kth 
virtual user. This waveform is generated by passing the spread code sequence c k [n] through a root-raised- 
cosine pulse shaping filter h[t]. If the Ath virtual user corresponds to a data user with spreading factor less 
than 256 then the code cjn] still has length 256, but only N k of the 256 elements are non-zero, where N k is 
the spreading factor for the kth virtual user. The non-zero values are extracted from the code C chi2 56,64 
SshM&Y The W-CDMA standard actually allows for up to six DPDCHs to be multiplexed with a single 
DPCCH. This functionality is not presently incorporated in the MUD algorithms described below. 

The baseband received signal can be written 

>W = -mT]b k {ml+ wfr] 

L 

where w[t] is receiver noise, J k [t] is the channel-corrupted signature waveform for virtual user k, L is the 

number of multipath components, and are the complex multipath amplitudes. The amplitude ratios ft 
are incorporated into the amplitudes a k ^. Notice that if k and / are two virtual users corresponding to the 
same physical user then, aside from scaling the by ft and ft, and a ig ' t are equal. This is due to the fact 
that the signal waveforms of all virtual users corresponding to the same physical user pass through the same 
channel. The waveform sjt] is now the received signature waveform for the feth virtual user. This 
waveform is identical to the transmitted signature waveform given in Equation (2) except that the root- 
raised-cosine pulse h[t] is replaced with the raised-cosine pulse g[t]. 

Thus far the received signal has been match-filtered to the chip pulse. It must next be match -filtered by the 
user code-sequence filter. The resulting detection statistic is denoted here as y k , the matched-filter output 
for the kih virtual user. Since there are K v codes, there are K v such detection statistics, which are collected 
into a column vector y[m] for the mfti symbol period. The matched-filter output yrfm], for the /th virtual 
user can be written 

V;[mJ S Rej^X —I'M, +T /9 +mr]. C ;[«]J (4) 
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where a lq is the estimate of a ] q > f lq is the estimate of N t is the (non-zero) length of code c { [n], and 
7]i[m] is the match-filtered receiver noise. Substituting r[t] from Equation (3) above gives 

[9=1 ^/V/ „ J 

= ZI Re k a v • ^rXX * [(n - + " ,T +# * - vfc t pi • c;w) 



(5) 



The terms for 0 result from asynchronous users. 



MUD Algorithm and Functions 

A vast number of MUD algorithms have been proposed [1][2]. Many of these are too computationally 
complex to be implemented with current technology. The linear-iterative class of MUD algorithms 
[4] [5] [6] are the least computationally complex. For this class of algorithms software implementation is 
feasible. The hard-decision variants of these algorithms also enjoy a significant performance advantage in 
that they do not tend to amplify other-cell interference. The down side is that performance degrades under 
high input BER. Since channel decoding reduces the BER by orders of magnitude, it is possible to be 
operating with raw channel BERs as high as 10%. A number of methods have been proposed to address this 
issue, including the null-zone detector [4], and partial interference cancellation [4][5][6]. We employ 
partial interference cancellation in conjunction with a new thresholding technique which reduces 
computational complexity. Our method provides excellent performance under high input BER. 

The implementation of MUD at the symbol rate can be divided into two functions. The first function is the 
calculation of the R-matrix elements. The second function is interference cancellation, which relies on 
knowledge of the R-matrix elements. Hie calculation of these elements and the computational complexity 
are described in the following section. Computational complexity is expressed in Giga Operations Per 
Second (GOPS). The subsequent section describes the MUD IC function. The method of interference 
cancellation employed is Multistage Decision Feedback IC(MDFIC)[2][7]. 



R-matrix 

From Equation (5) above, the R-matrix calculations can be divided into three separate calculations, each 
with an associated time constant for real-time operation, as follows 
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1 

2M 



r i « p 



L L 



(6) 




r tt [m] B^c 4 [/i-m]c;[n] 



where we have omitted the hats indicating parameter estimates. Hence we must calculate the R-matrices, 
which depend on die C-matrices (C^'lm']), which depend on the r-matrix (IWm/). The r-matrix has the 
slowest time constant This matrix represents the user code correlations for all values of offset m. For the 
case of 100 voice users the total memory requirement is 21 MB based on two bytes (real and imaginary 
parts) per element This matrix is updated only when new codes (new users) are added to the system. Hence 
this is essentially a static matrix. The computational requirements are negligible. The most efficient method 
of calculation depends on the non-zero length of the codes. For high data-rate users the non-zero length of 
the codes is only 4 chips long. For these codes a direct convolution is the most efficient method to 
calculation the elements. For low data-rate users it is more efficient to calculation the elements using the 
FFT to perform the convolutions in the frequency domain. 

The C-matrix is calculated from the r-matrix. These elements must be calculated whenever a users delay 
lag changes. For now assume that on average each multipath component changes every 400 ms. The length 
of the g[] function is 48 samples. Since we are oversampling by 4, there are 12 multiply-accumulations 
(real x complex) to be performed per element, or 48 operations per element. When there are 100 low-rate 
users on the system (200 virtual users) and a single multipath lag (of 4) changes for one user a total of 
(3 .5)(2)A: F L/V, elements must be calculated. The factor of 1 .5 comes from the 3 C-matrices (m* = -1, 0, 1), 
reduced by a factor of 2 due to a conjugate symmetry condition. The factor of 2 results because both rows 
and columns must be updated. The factor N v is the number of virtual users per physical user, which for the 
lowest rate users is N v = 2. In total then this amounts to 230400 operations per multipath component per 
physical user. Assuming 100 physical users with 4 multipath components per user, each changing once per 
400 ms gives 230 MOPS. 

The R-matrices are calculated from the C-matrices. From Equation (6) above the R-matrix elements are 



where a k are L x J vectors, and Culm'] are L x L matrices. The rate at which these calculations must be 
performed depends on the velocity of the users. The selected update rate is 1 .33 ms. If the update rate is too 
slow such that the estimated R-matrix values deviate significantly from the actual R-matrix values then 
there is a degradation in the MUD efficiency. Figure 1 below shows the degradation in MUD efficiency 
versus user velocity for an update rate of 1.33 ms, which corresponds to two WCDMA time slots. The plot 
indicates that there is high MUD efficiency for users with velocity less than about 100 km/hr. The plot 
indicates that the interference corresponding to fast users is not cancelled as effectively as the interference 
due to slow users. For a system with a mix of fast and slow users the resulting MUD efficiency is a average 
of the MUD efficiency for the various user velocities. 



(7) 
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Figure J. MUD efficiency versus user velocity in km/fir 

From Equation (7) the calculation of the R-matrix elements can be calculated in terms of an X-matrix 
which represents amplitude-amplitude multiplies 

rj/n'] = Rejfrfo" ■ C a [m'] • a k ]}= Re{r[c tk [m'j ■ a k • a? ReMcjm' J • X lk ]} 

=tr[c«w]-x2]McLm (g) 

C n [ml^Cr k [m'} + jC; k lm'} 

The advantage of this approach is that the X-matrix multiplies can be reused for all virtual users associated 
with a physical user and for all m* (i.e. m' = 0, 1). Hence these calculations are negligible when amortized. 
The remaining calculations can be expressed as a single real dot product of length 21} = 32. The 
calculations are be performed in 16-bit fixed-point math. The total operations is thus 1.5(4)(K V L) 2 = 3.84 
Mops. The processing requirement is then 2.90 GOPS. The X-matrix multiplies when amortized amount to 
an additional 0.7 GOPS. The total processing requirement is then 3.60 GOPS. 

MDFIC 

From Equation (5) above the matched-filter outputs are given by 

v,[m] = ^OlW+XrJ^ (9) 

The first term represents the signal of interest. All the remaining terms represent Multiple Access 
Interference (MAI) and noise. The MDFIC algorithm iteratively solves for the symbol estimates b,[m) 
using 
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b,lm) = ^n{y,[/n]-£r a [-l]&Jm+l]-|;[r tt [0]-rJ0]5 a ]Sjm]-^r tt P]b 4 [m-l]} < 10 ) 

with initial estimates given by hard decisions on the matched-filter detection statistics, b,[m] = sign{y,[m]\. 
The MDFIC [7] technique is closely related to the SIC and PIC technique. Notice that new estimates b^m] 
are immediately introduced back into the interference cancellation as they are calculated. Hence at any 
given cancellation step the best available symbol estimates are used. This idea is analogous to the Gauss- 
Siedel method for solving diagonally dominant linear systems. 

The above iteration is performed on a block of 20 symbols, for all users. The 20-symbol block size 
represents two WCDMA time slots. The R-matrices are assumed to be constant over this period. 
Performance is improved under high input BER if the sign detector in Equation (10) is replaced by the 
hyperbolic tangent detector [6]. This detector has a single slope parameter which is variable from iteration 
to iteration. 

The three R-matrices (R[-l], R[0] and R[l]) are each K v xK v in size. The total number of operation then is 
6K V 2 per iteration. The computational complexity of the MDFIC algorithm depends on the total number of 
virtual users, which depends on the mix of users at the various spreading factors. For K v = 200 users (e.g. 
100 low-rate users) this amounts to 240,000 operations. In the current implementation two iterations are 
used, requiring a total of 480,000 operation. For real-time operation these operations must be performed in 
1/15 ms. The total processing requirement is then 7.2 GOPS. Computational complexity is markedly 
reduced if a threshold parameter is set such that IC is performed only for values \yi[m]\ below the threshold. 
The idea is that if \ydm]\ is large there is little doubt as to the sign of b\[m], and IC need not be performed. 
The value of the threshold parameter is variable from stage to stage. 

Mapping to Hardware 

The above calculations are performed on a single 9"x6" card populated with four Power PC 7410 
processors. These processors employ the AltiVec SIMD vector arithmetic-logic unit, which has 32 128-bit 
vector registers. These registers can hold either 4 32-bit floats, 4 32 bit ints, 8 16-bit shorts, or 16 8-bit 
chars. Two vector SIMD operation (multiply and accumulate) can be performed by clock. The clock rate 
used for the current implementation is 400 MHz. The processors, however, can be operated at 500 MHz 
with higher clock speeds in the near future. Each processor has 32KB of LI cache and 2MB of 266MHz L2 
cache. The maximum theoretical performance of these processors is thus 3.2 GFLOPS, 6.4 GOPS (16-bit), 
or 12.8 GOPS (8-bit). The current implementation used a combination of floating-point, 16-bit fixed-point 
and 8-bit fixed-point calculations. 

The four PPC7410 processors are interconnected with a RACE++ 266MB/s 8-port switched fabric as 
shown in Figure 2. The high bandwidth fabric allows transfer of large amounts of data with very low 
latency so as to achieve efficient parallelism of the four processors. The maximum theoretical performance 
of the card is thus 51 .2 GOPS. 
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Figure 2. Partitioning of MUD functions across four processors 

As shown in Figure 2 the MDFIC and C-matrix calculations are allocated to a single processor. The other 
three processors are given to the R-matrix calculations which are considerably more complex. 

MUD BER Performance 

A sample of the Bit Error Rate (BER) performance of the MUD algorithm is shown in Figure 3. For 
comparison the matched-filter BER is also shown. The figure shows that MUD doubles system capacity. 
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Figure 3. LoglO bit error rate versus system capacity for matched 
filter (blue) and multiuser detection (red) 



Hie above performance is based on the following assumptions: 

• A single receive antenna is used 

• The target BER is 0.001 
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• The percentage of systems users in handoff is 30% 

• Other-cell interference is 35% of intra-cell interference. This is lower than the typical value (0.60) 
used. The reason is that the other-cell users in handoff with the cell of interest are included in the intra- 
cell interference. This is because the cell of interest is processing these users and hence can cancel 1 
there interference using MUD. 

• A 4-tap multipath channel is used. Each tap is Rayleigh fading. The composite power of all paths is 
perfectly power controlled. 

• The channel amplitude estimation error is 10% 

• The channel delay estimation is Va chip 

• The activity factor for voice is 0.40 

• The relative amplitude of the control channel is p c = 0.5333 



Conclusions 

The current state of processor technology is such that iterative hard-decision MUD for the UMTS uplink 
can be implemented in software on a single card or daughter card populated with four Power PC 7410 
processors, connected together with a high-bandwidth RACE-H- interconnect fabric. Hie use of short codes 
allows MUD to be performed at the symbol rate. The advantage of symbol-rate processing is that MUD can 
be introduced into a BTS as an enhancement to the conventional RAKE receiver. The MUD processing 
takes the MF detection statistics, performs interference cancellation, and then delivers improved hard or 
soft-decision symbol estimates to the symbol-rate BTS processing functions. The latency introduced is only 
a few milliseconds. In order to perform MUD at the symbol rate the R-matrices must be updated in real 
time. There is a minimal degradation in MUD efficiency if these elements are updated at a rate of once per 
1.33 ms. The R-matrices are used to cancel the multiple access interference through the MDFIC 
interference cancellation technique. At a BER of 0.001 the use of the above MUD technique doubles 
system capacity. 
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1. Introduction 

This report briefly describes long-code Multi-User Detection (MUD). Section 2 describes 
the long-code signal model, which is different from the short-code model. Section 3 
describes the matched-filtering operation for long codes and gives a lower bound on the 
GOPS required for long-code symbol-rate MUD. The lower bound is 19.7 TOPS (i.e. Tera 
Operations Per Second; 1 TOPS = 1000 GOPS). Because of the extreme computational 
complexity of symbol-rate MUD for long codes regenerative MUD is examined. It is shown 
in Section 4 that although regenerative MUD operates at the chip rate, the overall 
complexity is lower for long codes. Two methods are examined. The first method is a 
somewhat straight-forward implementation of regenerative MUD. The required 
computational complexity is shown to be 774.6 GOPS for 100 users. The second method 
is based on combining impluse trains and subsequently raised-cosine filtering the 
composite signal. The total computational complexity is shown to be 109.6 GOPS for 100 
users. Regenerative MUD is linear in the number of users, so that if the number of users 
is reduced to 64 the complexity drops to 70.1 GOPS. The complexity is also linear in the 
number of multipaths subtracted, so that if the number of multipaths subtracted is reduced 
from 4 to 2 the complexity drops to 35.1 GOPS. It may be desirable for MUD performance 
to subtract only the two largest multipaths due channel amplitude estimation errors. The 
above complexity figures are for a single interference cancellation stage. For two stages 
the computation is doubled. To perform regenerative MUD the baseband antenna stream 
data must be brought onto the MUD board. The required bandwidth is 123 MB/s. Note that 
the figures given above can perhaps be reduced through a clever implementation. A block 
diagram of regenerative MUD is shown to facilitate an investigation into the feasibility of 
an FPGA or ASIC implementation. 



2. Signal Model 
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The received signal model for short-code WCDMA is given in [1], When long codes are 
used the signal model is different since effectively the codes change from symbol to 
symbol. We present here the WCDMA signal model for long codes. The baseband 
received signal can be written 

IK 

Aft = 2 X ? *» [t - mJ * V k [m] + w[t) ( 1 ) 

Jb=l m 

where t is the integer time sample index, T k = N k N c is the data bit duration, which depends 
on the user spreading factor, N k is the spreading factor for the /cth virtual user, N c is the 
number of samples per chip, K is the total number of physical users, w[t] is receiver noise, 
and where 3^0] is the channel-corrupted signature waveform for the /cth virtual user over 
the mth symbol period. The concept of virtual users is used to account for both the 
DPDCH and the DPCCH. Hence if there are K physical users, then there are K v - 2K 
virtual users. The user signature waveform and hence the channel-corrupted signature 
waveform vary from symbol period to symbol period since long codes by definition extend 
over many symbol periods. For L multipath components the channel-corrupted signature 
waveform for virtual user k is modeled as 

5 4 »W = X^ te [^T Jlp ] (2) 

p=l 

where are the complex multipath amplitudes. The amplitude ratios fi k are incorporated 
into the amplitudes a*p. Notice that if k and / are virtual users corresponding to the 
DPCCH and the DPDCH of the same physical user then, aside from scaling the by p k 
and f} h and a^, are equal. This is due to the fact that the sig nal waveforms for both the 
DPCCH and the DPDCH pass through the same channel. 

The waveform s^tj is referred to as the signature waveform for the kth virtual user over 
the mth symbol period. This waveform is generated by passing the spreading code 
sequence c km [n] through a pulse-shaping filter g[t] 

C (3) 

r=0 

where g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine pulse as 
opposed to a root-raised-cosine pulse, the received signal r[tj represents the baseband 
signal after filtering by the matched chip filter. 



3. Matched filter 

The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. The resulting detection statistic is 
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denoted here as y^m], the matched-filter output for the kth virtual user over the rrfih 
symbol period. Since there are K v codes, there are K v such detection statistics, which are 
collected into a column vector y[m]. The matched-filter output ylm], for the Ith virtual user 
can be written 



s IX ~£ A*N 9 +f„ +m7)] - £ [*]} 

1 9=1 ^| n=0 J 



(4) 



where a* is the estimate of a%, f tq is the estimate of t lq , and r\(m] \s the match-filtered 
receiver noise. Substituting r/J/ from Equation (1) above gives 



+ M{nJV,+f„ +mT l ) jc^wj 



= Re 



1 >K 

g=l Z7V / w=0 



2JT [i t 



*=1 [9=1 p=l 



1 



']] c,>]j ^[m']| 



+77,[m] 



m' t=l 



<7=1 p=l 



•]J b k [m']+T},[m] 



C U!V [mM] = — ^s km [nN c +t llulp [m,m']]cl[n) 

n=0 



(5) 



In order to subtract interference we must, at a minimum, calculate C }kqp [m,m'] for all virtual 
users and for all multipath components. A lower bound on the computational complexity 
can be determined by considering the above calculations for synchronous users. For 
synchronous users, all at the highest spreading factor, the required number of operations 
to calculate C lkqp [m,mT\$ 8(256)(2K/.) 2 = 1.31 Gops for K= 100 and L = 4. For real time 
operation 15000 such computations must be performed every second. This amounts to 
1 9.7 TOPS (i.e. Tera Operations Per Second). 



4. Regenerative MUD 
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Because of the extreme computational complexity of symbol-rate MUD for long codes it is 
advantageous to resort to regenerative MUD when long codes are used. Although 
regenerative MUD operates at the chip rate, the overall complexity is lower for long 
codes. For regenerative MUD the signal waveforms of interferes are regenerated at the 
sample rate and effectively subtracted from the received signal. A second pass through 
the matched filter then yields improved performance. It turns out that the computational 
complexity of regenerative MUD is linear in the number of users. 

The received signal can be written 

IK L 
IK 

= X r *M+"M (6) 

L 

r *M 3 £X<V*J'-iV -<nT k f> k {m\ 
Subtracting interference gives a cleaned-up signal xffj 

2K 

x,M = rit]- £f t [f] 

IK 

= r[t]-r[t] + m 
= f,[t]+r„,[t] 

Two methods are presented below for performing regenerative MUD. 
First Method 

In order to subtract interference we must reconstruct (regenerate) the waveform s km [t] as 
given in Equation (3). The waveform can be reconstructed using 
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p=0 j=Q 
N k IA-\ 3 

= ^W^4pAT c ] (8) 



The idea is that s^ft/ can be represented as a summation of shifted waveforms s kmp [t], 
which are entirely specified by the 8 binary numbers comprising the complex sequence 
Ckmpffl of length 4. Hence there are only 2 8 = 256 such waveforms. For what follows we 
assume that the signals are sampled at N c = 8 samples per chip. Each is of length 96 + 
3(4) = 108 samples assuming that g[t] is of length 96. For 2 bytes per sample (real and 
imaginary parts) the total memory requirement is 216*256 = 55296 bytes, which spills out 
of L1 cache, but fits entirely in L2 cache. 



To generate f k [t] for a single symbol period, 64 of these waveforms must be read from 

memory. For each of these 64 waveforms L complex macs are required per sample per 
symbol period. Hence 64(8L)(108) operations are required per symbol period. For L = 4 
this amounts to 64(32)(108) = 221184 operations per symbol period (1/15 ms), or 3.32 
GOPS. The formation of rj[t] then requires 2K times this, or 3.32(200) = 664 GOPS for K 
= 100 physical users. To form r,[f] + /■„,!>] requires an additional 2(96+255*4) = 2232 

operations per symbol period per virtual user, or another 6.7 GOPS. Finally, the matched 
filter operation needs to be performed for each user, which from Equation (4) requires 
NLK complex macs (N = 256), or 256(4)(100)(8)*15000 = 12.3 GOPS. The GOPS figures 
above are for a single antenna. For two antennas the operations are doubled. Hence the 
total computational complexity is 2(664 + 6.7 + 12.3) = 1.37 TOPS. This is for a single- 
stage MPiC algorithm. For two stages the computation is doubled. 

To perform regenerative MUD the baseband antenna stream data must be brought onto 
the MUD board. The required bandwidth is 

[2 Bytes(complex)/Sa/Ant][2 Ant][8 Sa/chip][3.84 Mchips/ second] = 123 MB/s 



Second Method 

The second method is to represent the waveform for each multipath for each user as a 
complex impulse train with N c = 8 samples per impulse. The complex amplitude of each 
impulse is the product of the complex chip, complex multipath amplitude and the binary 
(real) data bit estimate. These 2KL complex streams (times 2 for 2 antennas) are added to 
form a composite signal. Since this composite signal is a sum many impulse trains, all 
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asynchronous, the composite signal is a dense (i.e. no systematic zeros) signal at the 
sample rate. A block diagram of the processing is shown in Figure 1 . 
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Figure 1. A block diagram of the long-code MUD processing 
From Equations (7) and (8) 



IK L 



Jtt=l m p=\ 

2K L N t -1 

= ZX a *SE«r'- f * ~ mT k -rN e ] Cba [r$ k [m\ 

k~\ p~\ m r=0 

2K L N k -l 

2 AT I 

= f,i a ^I,8[r)S[t-r-f ¥ -nNJc k [n]b k [ln/Nj 

*=1 p-\ r n 
2K L 

k=l r p=\ n 

= ^g[r]cc[t-r) 



*=1 p=l n 



0) 



where a/if/ is the composite signal. For each symbol period this requires 256(1 0)(2KL) 
operations per antenna. For two antennas this amounts to 5120(200)(4) = 4096000 
operations per symbol period, or 61 .4 GOPS. 
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The estimate of the received signal is then determined by passing the composite signal 
through the raised-cosine filter g[t]oi length 96, which requires 96 real macs, or 192 real 
operations, per sample per real stream. There are a total of 4 real streams (2 antennas, 
real and imaginary streams). The total GOPS then for N c = 8 samples per chip is 
1 92(4)(8)(3.84M) = 23.6 GOPS. 

The final step is to pass the cleaned-up signal x { [t) = f t [t] + r r Jt] through the matched-filter 
(i.e. rake receiver) which gives the improved detection statistic 

= Re {|X • aJrS'i^. +f * + ^]-4[n]J 
it< ~£u«Ne +*„ +«r l ]-clw} 



ZiV / n=0 



= Re 



= A 2 •*,[»»] +y! l XM) 



U=> J 



(10) 



The matched filter operation requires NLK complex macs, or 256(4)(100)(8)*15000 = 12.3 
GOPS. The GOPS figures above are for a single antenna. For two antennas the 
operations are doubled, giving 24.6 GOPS. The total computational complexity for the 
second method is then 61 .4 + 23.6 + 24.6 = 1 09.6 GOPS. 
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1. Introduction 

This report investigates a number of different methods for calculating the R- 
matrix elements. There are two parts to the calculation. First is the calculation of 
the user code correlations at lag offsets determined by the searcher receivers. 
This calculation must be performed every time a multipath component changes 
to a new lag. The assumption used here is that every 100 ms one multipath 
component changes to a new lag for each user. Hence, if. each user has 4 
multipath lags, then all R-matrix elements will have changed after 400 ms. The 
validity of this assumption will have to be tested with measured data. Note that 
the WCDMA standard call out a test with 2 multipath components, where one lag 
changes every 191 ms [1]. The second part is the actual calculation of the R- 
matrix elements, which requires a double summation of code correlations over all 
multipath components, with each term scaled by the Rayleigh-fading multipath 
amplitudes. The maximum time period to perform this calculation is about 1 .33 
ms. Hence there are two parts to the calculation, each with a different update 
rate. 

Section 2 is devoted the first part of the calculation, the code correlations. 
Section 3 covers the actual calculation of the R-matrix elements. 
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2. Calculation of User Code Correlations 

The R-matrix elements can be expressed as [2] 

where /m7 is a five-dimensional matrix of code correlations. Both / and k 
range from 1 to K Vl where K v is the number of virtual users. If there are K physical 
users, all operating at the highest spreading factor, then there are K V = 2K virtual 
users. For now consider K = 128 so that K v = 256. The indices q and q' range 
from 1 to L, the number of multipath components, which for this report is 
assumed to be equal to 4. The symbol period offset m' ranges from -1 to 1 . The 
total number of matrix elements to be calculated is then 
N c =3(K V L) 2 =3(1024) 2 =3M complex elements, or 24 MB if each element is a 
float. This number is reduced, however, due to the symmetries 

Z/V * n p 

= ^YZs\ri.n-p)N c -wT-f„ +i t9 .]c l [n] c' l [p] 

(2) 

= ^irE Z *K» " P) w e + «T +f „ -f v ] C ; [p] • c,[«] 

so that it is sufficient to store elements for offsets m' = 0,7. The memory 
requirement is then 16 MB if each element is a float. If the elements are stored 
as bytes the requirement is reduced to 4 MB. 
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Referring to Equation 1, line 2, it is evident that each element of Cik qq > [m'] is a 
complex dot product between a code vector c } and a waveform vector s kqq \ The 
length of the code vector is 256. The length of the waveform vector is L g + 255N C , 
where L g is the length of the raised -cosine pulse vector g[t] and N c is the number 
of samples per chip. The values for these parameters as currently implemented 
are L g = 48 and N c = 4. The length of the waveform vector is then 1068, but for 
the dot product it is accessed at a stride of N c = 4, which gives effectively a 
length of 267. Note that the code and waveform vectors in general do not entirely 
overlap. Also note that an increment or decrement in the symbol offset index m' 
slides the waveform vector 256 elements to the left or right respectively. Figure 1 
shows that the total number of complex macs (cmacs) for all three (m'= -1, 0, 1) 
dot products is 267, irrespective of any relative offset. 













J 




E 











Figure 1. Overlap of waveform and code vectors. The total 
number of complex macs (cmacs) for all three (m'=-1,0, 1) 
dot products is 267, irrespective of any relative offset 

Hence for any given combination of indices Ikqq* the three elements [m'J, 
corresponding to m'= -1, 0 and 1 require 267 cmacs to calculate all three. Since 
there are (K v Lf combinations of indices, the calculation of all elements C/ W /m7 
requires (K V L) 2 (267) cmacs. Given the symmetry condition, only half of the 
elements need to be calculated, and noting that each cmac requires 8 operation 
to perform; the total number of operations required is 



N ops = \{K V L) 2 (267)(8) = |(1024) 2 (267)(8) = 1 .12 G ops 



(3) 



The total number of GOPS (Giga Operations Per Second), then, given the 400 
ms update rate is 



_ \{K v Lf (267)(8)o/>j _ i(1024) 2 (267)(8)<?/7* 

"GOPS — " = 



2.80 GOPS 



400»w 400/n^ 
The next section addresses the calculation of the R-matrix elements. 



(4) 
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3. Calculation of R-matrix Elements 

Consider the calculation of the R-matrix elements 

Pdm'U k = t£ R *K a v ' C,wWi (5) 

The total number of matrix elements to be calculated is N p = 3# v 2 . This number 
is reduced, however, due to the symmetries 

so that the total number of matrix elements to be calculated is N p - f tf v 2 . 

Now let us consider the operations per element. Dropping explicit reference to 
the symbol period offset [m] t the matrix elements are 

M-i2^K\' c i k J (7) 

<7=1 <?'=] 

A brute-force calculation requires L 2 (6 + 3 + 1) operations (1 complex multiply, 
one half-complex multiply - i.e. the real part -- and one real add, or 6 real 
multiplies and 4 real adds). The total operations is then 

N ops =±{K v LfW) (8) 

For a vehicular speed of 120 km/h the Doppler frequency is 216.67 Hz for a user 
at frequency 1950 MHz. The coherence bandwidth is thus 433.33 MHz, and the 
corresponding coherence time is about 2.3 ms. Hence the multipath amplitudes 
are changing with a time constant of about 2 ms, and consequently the second 
part of the calculation must be updated at least every 2 ms. The channel 
amplitudes are calculated on a time slot by time slot basis. Each time slot is 
10/15=2/3 = 0.67 ms. Hence 2 ms equals 3 time-slots, whereas two slots equals 
1.33 ms. Figures 2 and 3 below show the MUD efficiency versus user velocity for 
2 ms and 1.33 ms update times respectively. The plots show that to be able to 
effectively handle high velocity users the update time should be 1.33 ms. When 
users are at various speeds the interference from low speed users is cancelled 
more effectively than the interference from high speed users. The MUD efficiency 
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will then be an average of the MUD efficiency corresponding to each user's 
speed. 

Calculations updated every 2 ms (3 time slots) 




0 20 40 60 BO 100 120 



User Velocity (kmph) 

Figure 2. MUD efficiency versus user velocity for a 
2 ms R-matrix update time. 



Calculations updated every 1.33 ms (2 time slots) 




| 0.4 1 ! 

0.3 ; ; : ; 

0.2 1 .... J ! 

o.i ; ; ■ 

ol 1 ■ 1 1 \ 

0 20 40 60 80 100 120 

User Velocity (kmph) 

Figure 3. MUD efficiency versus user velocity for a 
1.33 ms R-matrix update time. 



The calculations below are based on a 1.33 ms update time. Note that most of 
the capacity and coverage benefits calculated for MUD so far have assumed 
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70% MUD efficiency. The 1 .33 ms update time is sufficient to achieve 70% MUD 
efficiency. The total GOPS are then, 



1.33 ms 1.33 ms 



0) 



where we have assumed L - 4 multipath components. A better way to perform 
this operation is 



<H [ q^l J 



(10) 



The inner sum is a matrix-vector multiply, hence requiring L 2 cmacs, and the 
outer sum is the real part of a compex dot product, which requires L half-cmacs. 
The total is then (L 2 + U2) = 1.125 L 2 cmacs (for L = 4) times 8 operations per 
cmac, or 9L 2 operations, which gives 



1.33 ms 133 ms 



01) 



The above calculations are represented in terms of complex numbers, which are 
not directly calculable. To express the above equations explicitly in terms of real 
numbers it is convenient to cast the calculations into matrix form 



L L 



9=1 »'=) 



Re 



■Kefe'-C.-a,} 





... c 


5 U 


C Ik 22 




a k2 




... r 





(12) 











Cikn 










^/Jfc2i 


Clk22 
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ClkLl 





The quadratic form a/ H C^a k can be expressed 
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M*" • C lk ■ a k }= Refe - jaj\ [C, + jC,\ \b r + jb,]} 

= Refe -jaf\ [C r b r -Ch + j{Cjb, +C,b r )} 

^JalCA-a'rCfr+aJCA+aJCb, ' 1 

1 +Xa T r C r b, +a:C i b r -aJC,b r +afC i b i )\ 



(13) 



The matrix-vector multiplication requires macs. The dot product adds (2L) 
macs so that the total is (2L) 2 + (2L) macs. For L = 4 we have 1. 125(2Lf macs = 
4.5L 2 macs = operations. The total GOPS are then 



N. 



GOPS 



^(K V L) 2 (9) 1.5(256 -4) 2 (9) 



1.33 //w 



1.33 ms 



= 10.6 GOPS 



(14) 



Now consider a different formulation which attempts to reuse the amplitude- 
amplitude multiplications. Consider the calculation a T C b 



a T Cb = tr[a T Cb]=tr[c(ba T )]=tr[CX] 
X=ba T 



(15) 



The calculations to produce matrix X are pure multiplications, but the elements, 
once calculated, can be reused for the other virtual users corresponding to the 
same physical users. For voice-only users there are 2 virtual users per physical 
user. For data users there can be up to 65 virtual users per physical user. For 
now, however, we stay with our 128 voice-user scenario. To calculate X, then, 
requires (2Lf = 4L 2 multiplications. This calculation is performed once per pair of 
physical users, so the total number of operations is 



N ops =(/rL) 2 (4) = (^L) 2 (l) = f (K V L) 2 (±) 



(16) 
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Effectively, then, X requires (2/3)L 2 operations. The details to calculate a T C b 
are 



a T C b = tr[C X]=tr 



(17) 



1=1 



where C/ is the Ah row of C and x f is the Ah column of X. Hence we have 2L dot 
products of length 2L, which require (2Lf macs = 81? operations. To calculate 
a H C b then requires 81?+ (2/3)L 2 = 8.67L 2 operations, which gives 



1.33 rns 1.33 ms 



(18) 



A better way to perform this calculation is as follows 



l L 



q-\ q'=\ q-\ q=l 



XX Re K-^)(c;.+yc;.)} 



q=\ q±i 



=XXfcc;+^-c;.) 



9=1 q±\ 



(19) 



where for convenience we have dropped A k> the Ik subscripts and the hat 
symbols. The calculation of X requires 



N m = (KL) 2 (6) = (tf v L) 2 (6/4) =|(AT V L) 2 (1) 



(20) 



operations. Note that, once the X values are calculated, the remainder of the 
calculation is a long dot product of length 2L 2 , hence requiring 2L 2 macs, or 4L 2 
operations, which gives 
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Nom = f(^(5) = 1.5(256-4)-(5) =59GOps 
C0PS 133 nis 1 33 ms 



Dual. Diversity Antennas 

When dual diversity antennas are employed, the calculation of the R-matrix 
elements becomes 

9=1 <J =1 ?=J ?'=1 



(22) 



To calculate Xfor dual diversity antennas, then, requires 

N op5 = (KL) 2 (U) = (K v L) 2 (14/4) = f (K v L) 2 (l/3) =-f(tf v L) 2 (2.33) (23) 

operations. The remainder of the calculation is again a long dot product of length 
2L 2 requiring 4L 2 operations, which gives 

= {24) 
G0PS 1 33 ms 133 ms 



Reuse of C data 



So far we have not addressed the problem associated with a lack of data reuse, 
which renders our calculations I/O limited. The C data can be reused by 
introducing extra latency into the calculations. For a given user, a single 
multipath component changes on average once every 100 ms, or once every 150 
slots. Suppose we collect and save in cache 4 amplitude estimate vectors a*/q/, 
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where q is the 2 ms update index. The total latency is then 8 ms = 12 time slots. 
During this time the probability that a multipath lag changes is (8 ms)/(100 ms) = 
.08. The probability that the matrix changes is then = 1-(1-0.08) 2 - 0.15. 
Hence for most matrices C* we will be able to calculate 

^[q]C lk a k [q] (25) 

for 12 time slots q for only one read of from memory. The penalty for this 
reuse is the 8 ms of latency incurred. 
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To: John Oates, John Greene, Alden Fuchs, Frank Date: 3 l-AUG-2000 
Lauginiger 

From: Mike Vinskus 

Subject: Theoretically optimum load balancing for the R File Ref: mjv-9.doc 
matrix calculations 

This memo describes the calculation of optimum R matrix partitioning points in 
normalized virtual user space. These partitioning points provide an equal, and hence 
balanced, computation load per processor. The computational model of the R matrix 
calculations does not include any data access overhead or caching effects. It is shown 
that a closed form recursive solution exists that can be solved for an arbitrary number of 
processors. 

Although three R matrices are output from the R matrix calculation function, only half of 
the elements are explicitly calculated. This is due to the symmetry condition that exists 
between R matrices: 

R !k (m) = ZR kJ (-m). 

In essence, only two matrices need to be calculated. The first one is a combination of 
R(l) andR(-l). The second is the R(0) matrix. In this case, the essential R(0) matrix 
elements have a triangular structure to them. The number of computations performed to 
generate the raw data for the R(l)/R(-1) and R(0) matrices are combined and optimized as 
a single number. This is due to the reuse of the X matrix outer product values across the 
two R matrices. Since the bulk of the computations involve combining the X matrix and 
correlation values, they dominate the processor utilization. These computations are used 
as a cost metric in determining the optimum loading of each processor. 

The optimization problem is formulated as an equal area problem, where the solution 
results in each partition area to be equal. Since the major dimensions of the R matrices 
are in terms of the number of active virtual users, the solution space for this problem is in 
terms of the number of virtual users per processor. By normalizing the solution space by 
the number of virtual users, the solution is applicable for an arbitrary number of virtual 
users. 
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Figure 1: Normalized R matrix computation model. 

Figure 1 shows the model of the normalized optimization problem. The computations for 
the R(l)/R(-1) matrix are represented by the square HJKM, while the computations for 
the R(0) matrix are represented by the triangle ABC. From geometry, the area of a 
rectangle of length b and height h is 

A r =bh. 

For a triangle with a base width b and height h, the area is calculated by 
When combined with a common height a h the formula for the area becomes 

1 2 



The formula for A\ gives the area for the total region below the partition line. For 
example, the formula for A2 gives the area within the rectangle HQRM plus the region 
within triangle AFG. For the cost function, the difference in successive areas is used. 
That is 

B i= A i~A-\ 

1 2 1 2 

= 2*/ +a i-2 ai - l ~~ a '- 1 
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For an optimum solution, the B t must be equal for i = 1, 2, N 9 where JVis the number 
of processors performing the calculations. Because the total normalized load is equal to 
An, the loading per processor load is equal to An /N 

By combining the two equation for i?„ the solution for a\ is found by finding the roots of 
the equation: 

2 1 2 11 M IN 

The solution for a/ is: 

a t = -l±^l + a^+2fl H +-^ f for i = 1, 2, N. 

Since the solution space must fall in the range [0, 1], negative roots are not valid 
solutions to the problem. On the surface, it appears that the a, must be solved by first 
solving for case where i = 1 . However, by expanding the recursions of the a { and using 
the fact that a 0 equals zero, a solution that does not require previous a, , i = 0, 1, n-l 
exists. The solution is: 

a,=-l + Jl+— 
' V N 

Table 1 shows the normalized partition values for two, three, and four processors. To 
calculate the actual partitioning values, the number of active virtual users is multiplied by 
the corresponding table entries. Since a fraction of a user cannot be allocated, a ceiling 
operation is performed that biases the number of virtual users per processor towards the 
processors whose loading function is less sensitive to perturbations in the number of 
users. 



Table 1: Normalized partition locations for two, three, and four processors. 





Two p^ocessor^ ^ 


"' TJiree p*6cessor#> 


Four processors ; U 




+ (0.5811) 


-1 + V2 (0.4142) 


- ,+ \ 


^ (0.3229) 


a 2 




-1 + V3 (0.7321) 




H (0.5811) 


a 3 








^ (0.8028) 
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To: 



Jonathan Schonfeld 



Date: 23-FEB-2001 



From: 



Nmf 



Subject: 



Degraded mode of operation for the MUD 
algorithm 



FileRef: mjv-018- 
degraded_mode_desc.doc 



Reference [1] showed that the load balancing for the R matrix calculations resulted in anon -uniform 
partitioning of the rows of the final R matrices over a number of processors. In summary, the 
partition sizes increase as the partition starting user index increases. 

When the system is running at full capacity (Le. the maximum number of users is processed while 
still within the bounds of real-time operation ) and a computational node has a failure, the impact can 
be significant. 

This impact can be nrinimized by allocating the first user partition to the disabled node. Also the 
values that would have been calculated by that node are set to zero. This reduces the effects of the 
failed node. Also, by changing which user data is set to zero (i.e. which users are assigned to the 
failed node ) the overall errors due to the lack of non-zero output data for that node are averaged 
over all of the users, providing a "soft" degradation. 



[1] M. Vinskus. "mjv-009: Theoretically optimum load balancing for the R matrix calculations." 
31-AUG-2000. 

[2] M. Vinskus. "mjv-010: Preliminary degraded MUD operation results " 19-OCT-2000. 
[3]J.Oates. "jho-001: MUD Algorithms", 25-APR-2000 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Methods for Calculating the C-matrix Elements Date: November 1 3, 2000 
1. Direct Method 

The direct method for calculating the C-matrix elements is 

(D 



Symmetry 



ZiV l „ 

C W 9 Hn>^C; ?1 [m] (2) 

Due to symmetry there are 1.5(K v Lf elements to calculate. Assuming all users 
are at SF 256, each calculation requires 256 cmacs, or 2048 operations. The 
probability that a multipath changes in a 10 ms time period is approximately 
10/200 = 0.05 if all users are at 120 kmph. Assuming a mix of user velocities, 
let's say the probability is 0.025. Since the C-matrix elements represent the 
interaction between two users, the probability that C-matrix elements change in a 
10 ms time period is approximately 0.10 for all users are at 120 kmph, or 0.05 for 
a mix of user velocities. The GOPS are tabulated in Table 1 below. 
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The C-matrix elements also need to be updated when the spreading factor 
changes. The spreading factor can change due to 

• AMR codec rate changes 

• Multiplexing of DCCH 

• Multiplexing data services 

For lack of a better number, assume that 5% of the users, hence 10% of the 
elements change rate every 10 ms. 



Table 1. GOPS to update C-matrix elements using the direct method 



K v 


High velocity 
users 


1.5(K v Lf 


Gops 


Percentage 
change 


GOPS 


200 


100% 


960,000 


1.966 


20 


39.3 


200 


50% 


960,000 


1.966 


15 


29.5 


128 


100% 


393,216 


0.805 


20 


16.1 


128 


50% 


393,216 


0.805 


15 


12.1 
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2. FFT Method 

The FFT can be used to calculate the correlations for a range of offsetsi using 

(3) 

The length of the waveform s*/f/ is L g + 255A/ C = 1 068 for L g = 48 and A/ c = 4. This 
is represented as N c waveforms of length L</A/ c + 255 = 267. 

One advantage of this approach is that elements can be stored for a range of 
offsets t so that calculations do not need to be performed when lags change. For 
delay spreads of about 4^is 32 samples need to be stored for each m'. 



L 
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3. Using Code Correlations 

The C-matrix elements can be represented in terms of the underlying code 
correlations using 

Q w [m'] es JL]T , j„N c + m >T + f lq -f • eft*] 

=^rXX«K»-P)^ +mT+f % -TV] cJp] c ;[n] 
zyv / » p 

= ^rr y LJ J 8[mN c +ryc k [n-m]c' l [n] 

= 2^mtf c +T] ^Xc/M c^n-m] (4) 
= X«[«tf«+*]-r.[m] 



r '* [m] S oaT^ C, * M ' Ck[n ~ m] 

If the length of g[tj is L ff = 48 and N c = 4, then the summation over m requires 
48/4 =12 macs for the real part and 12 macs for the imaginary part. The total ops 
is then 48 ops per element. (Compare with 2048 operations for the direct 
method.) Hence for the case where there are 200 virtual users and 20% of the C- 
matrix needs updating every 10 ms the required complexity is (960000 el)(48 
ops/el)(0.20)/(0.010 sec) = 921.6 MOPS. This is the required complexity to 
compute the C-matrix from the r-matrix. The cost of computing the r-matrix must 
also be considered. There is reason to hope that the r-matrix can be efficiently 
computed since the fundamental operation is a convolution of codes with 
elements constrained to be +/-1 +/-j. 

The r-matrix elements can be calculated using 

• the FFT 

• Modulo-2 arithmetic 

• Hardware XOR 

• Short-code generator(?) 
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4. Using Fundamental Correlations 

The waveform Sk[t] can be decomposed into fundamental waveforms 
corresponding to 4-chip segments of the corresponding complex user codes. 
There are 2® = 256 such waveforms. Each of these can be correlated with 
another 256 possible 4-chip code segments. For each correlation there are about 
64 offsets that produce a non-zero correlation. Hence all correlation calculations 
can be represented in terms of 256(256)(64) = 4M fundamental complex 
correlations. The C-matrix elements are then 

c ik M ^ ii^ri^t^c +*] ' c B [n] 

/=0 j=0 Z/V / n=0 
63 63 

Using the above, each C-matrix element requires 64(64) = 4096 complex adds, 
or 8192 operations to calculate. (Compare with 2048 operations for the direct 
method.) 

Alternately, the calculations can be represented in terms of 4-chip real code 
segments and the corresponding waveforms. Hence all correlation calculations 
can be represented in terms of 16(16)(64) = 16K fundamental real correlations. 
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Subject: Calculation of C-matrix Elements 



Date: August 10, 2000 



1. Introduction 

The C-matrix elements are used to calculate the R-matrices, which are used by the MDF 
interference cancellation routine. Each C-matrix element can be calculated as a dot 
product between the Ath user's waveform and the /th user's code stream, each offset by 
some multipath delay. For this method of calculation, each time a user's multipath profile 
changes all C-matrix elements associated with the changed profile must be recalculated. 
It is estimated that a user profile changes every 100 ms. This number, however, is based 
on very little data, and there is considerable risk that profiles may change more rapidly 
and compromise real-time operation. In addition, there is a large amount of overhead that 
must be performed before each dot product. In a recent benchmark the overhead 
consumed nearly all of the time allocated for the entire C-matrix update. Finally, if the C- 
matrix is calculated as described above then an entire processor must be allocated for 
this calculation. 

In view of the above observations a better approach is to pre-calculate the code 
correlations up-front when a user is added to the system. This calculation is performed 
over all possible code offsets and the calculations are stored in a large array, 
approximately 21 Mbytes in size. We will henceforth refer to this large matrix as the r 
matrix. The C-matrix elements are updated when a profile changes by extracting the 
appropriate elements from the r matrix and performing minor calculations. Since the T 
matrix elements are calculated for all code offsets the FFT can be effectively used to 
speed up the calculations. Since all code offsets are pre-calculated, there is no risk 
associated with rapidly changing multipath profiles. Under normal operating conditions 
when the number of users accessing system is constant the resources which must be 
allocated to extracting the C-matrix elements are minimal, and so extra resources may be 
allocated to the R-matrix calculation. 

Section 2 below outlines the calculation of the T matrix elements. It is shown that the r 
matrix elements are given in terms of a convolution. Section 3 shows how to calculate the 
r matrix elements using the FFT. Section 4 describes how the r-matrix elements might be 
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accessed from SDRAM. In section 5 various processing times are estimated, and a 
summary with conclusions is given in section 6. 

2. C-matrix Elements Expressed in Terms of Code Correlations 

The R-matrix elements are given in terms of the C-matrix elements as [1] 

q-\ 9'=] 

(D 

Q^m'] - ^-E^t^c + mT +f „ -f^] • c;[n] 

where Ci kqq <[m']\s a five-dimensional matrix of code correlations. Both /and /e range from 1 
to K Vl where /C„ is the number of virtual users. The indices gand q' range from 1 to L, the 
number of multipath components, which is assumed to be equal to 4. The symbol period 
offset m' ranges from -1 to 1 . The total number of matrix elements to be calculated is then 
N c =3{K V L) 2 =3(800) 2 =1.92M complex elements, or 3.84 MB if each element is a byte. 
This number is cut in half, however, due to the symmetries [2] 

^,[-m'] = ^C; w .[m'] (2) 
The memory requirement is then 1 .92 MB. 

Referring to Equation (1) it is evident that each element of Ci kqq {m'] is a complex dot 
product between a code vector q and a waveform vector s^. The length of the code 
vector is 256. The waveform s k [t] is referred to as the signature waveform for the kth 
virtual user. This waveform is generated by passing the spread code sequence c k [ri] 
through a pulse-shaping filter g[t] 

sd^Itglt-pHJcJp] (3) 

where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the signature waveform s k [t] includes the 
effects of filtering by the matched chip filter. Note that for spreading factors less than 256 
some of the chips c k [p] are zero. The length of the waveform vector is L g + 255N C , where 
L g is the length of the raised-cosine pulse vector g[t] and N c is the number of samples per 
chip. The values for these parameters as currently implemented are L g = 48 and N c = 4. 
The length of the waveform vector is then 1 068, but for the dot product it is accessed at a 
stride of N c = 4, which gives effectively a length of 267. 

The raised-cosine pulse vector g[t] is defined to be non-zero from t = -LJ2 + 1:L/2, with 
g[0] = 1. With this definition the waveform s k [t] is non-zero from t = -L^2 + 1: LJ2+ 255N C . 
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By combining Equations (1) and (3) the calculation of the C-matrix elements can be 
expressed directly in terms of the user code correlations. These correlations can be 
calculated up front and stored in SDRAM. The C-matrix elements expressed in terms of 
the code correlations T tk [m] are 

= Txr EX « k* - + mT -f v ] . c k [ P ] - c ; w 

= ^ XX « t /niV c +T] ■ c k [» - m] • c,* [«] 

n m 

= J,g[mN c +T]~-J j c' l [n]c k ln-m] (4) 

m ZiV| n 



Since the pulse shape vector p/n/ is of length Lg there are at most 2LJN C = 24 real macs 
to be performed to calculate each element Q kqq {m']. (The factor of 2 is because the code 
correlations Frfm] are complex.) Given x it is important to be able to efficiently calculate 
the range of values m for which g[mN c +i]is non-zero. The minimum value of m is given 
by rn^mNc + % = - LJ2 + 1 . Now x is given by % = m7VA/ c + t /Q - t^. If each x value is 
decomposed x iq = r^Ak + p /Qj then /tw = ceil[ (- % - L</2 + 1 )/A/ c ] = -/77 W - + n kq ' - 
V(2A/ C ) + ceil[ (/v- p /<7 + 1)/W C ]. Now ceil[ (p^- P/ Q + 1)/A/ C ] will be either 0 or 1 . It is 
convenient to set this to 0. In order that we do not access values outside the allocation for 
g[n] we must set g[n] = 0.0 for n = - LJ2: -L</2- (N c - 1). Note that of the N 0 * possible 
values for ceil[ (p v - p, q + 1)/A/ C ], all but one are 0. Hence we have 

'"mm, =~^N -n lq +n w - L f t(2N e ) (5) 
Note that L 9 must be divisible by 2N C , and that V(2A/ C ) should be a system constant. 

The maximum value of m is given by rm max1 N c + % = L</2. This gives rrw, = fioor[ (-x + 
LJ2)IN C ] = -m'N - n, q + + LJ{2N C ) + floor[ (p* q .- p lq )/N c ]. Now floor[ (p* q - p lq )/N c ] will 
be either -1 or 0. It is convenient to set this to 0. In order that we do not access values 
outside the allocation for g[n] we must set g[n]= 0.0 for n = -Lg/2 + 1: L/2 + N c . Note that 
of the Nc possible values for floor[ (p^- p lq )/N c ], about half are 0. Hence we have 

»Wi =~m'N-n lq +n kq ,+L g f(2N c ) (6) 
These values are quickly calculable. 
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The r matrix is calculated in the next section for all values m by exploiting the FFT. Notice 
that the calculation of the Omatrix elements requires only a small subset of the T matrix 
elements. 

3. Using the FFT to Calculate the r-matrix Elements 

In the previous section it was shown that the r-matrix elements can be represented as a 
convolution. This fact is here exploited to calculate the r-matrix elements using the FFT 
convolution theorem. From Equation (4) the r-matrix elements are 

1 N ~ l 

r ik M = rrrS C *M ' c * I" " m l (7) 

n=0 

where N = 256. Three streams are related by this equation. In order to apply the 
convolution theorem all three streams must be defined over the same time interval. The 
code streams c^n] and C([n] are non-zero from n = 0:255. These intervals are based on 
the maximum spreading factor. For higher data-rate users the intervals over which the 
streams are non-zero are reduced further. We are concerned here, however, with the 
intervals derived from the highest spreading factor since these will be the largest intervals 
and we wish to define a common interval for all streams. The common interval allows the 
FFTs to be reused for all user interactions. 



1^ = 256 



c k [n-nw] 



n = -256 



n = 0 



n = 255 



Figure 1. Interval for FFT calculation of the r matrix elements. Shown 
For the case where N k = 256 and N, = 128. 

The range of values m for which Ti k [m] is non-zero can be derived from the above 
intervals. The maximum value of m is limited by n-m > 0, which gives 



255 =0 => =255 
and the minimum value is limited by n-m <255, which gives 



0-m m s =255 



m min =-255 



(8) 



0) 



To achieve a common interval for all three streams we select the interval m =-M/2: M/2 - 
7, M = 572. Where necessary the streams are zero-padded to fill up the interval. 
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Now, the DFT and IDFT of the streams are 



M 



(10) 



f"' 



which gives 



2 



(11) 



r=-~ r =— — n=— — 

2 2 2 

—I 



Hence r,*//n./can be calculated using the FFTs. Notice that the FFT gives values for all m. 
From the analysis above we know that many of these values will be zero for high data rate 
users. To conserve memory we wish to store only the non-zero values. The values of m 
for which Ti k [m] is non-zero can be determined analytically. This subject is treated in the 
next section where the storage and retrieval of the r-matrix elements is considered. 



4. Storage and Retrieval of r-matrix Elements 



In order to efficiently store the r-matrix elements we must determine which values are 
non-zero. For high data rate users certain elements c{n]axe zero, even within the interval 
n = 0:N -1, N = 256. These zero values reduce the interval over which T fk [m] is non-zero. 
In order to determine the interval for non-zero values consider 

I N-l 

r ik M s — £ c] [n] • c k [n -m] (12) 
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Define index j t for the Ah virtual user such that Ci[n] is non-zero only over the interval 
n = j f Nj ry/fy +N } -1. Correspondingly, the vector crfn] is non-zero only over the interval 
n = Jk N k : Jk^k + N k -1 • Given these definitions r {k [m]can be rewritten as 

r '* [m] s X c > + A* i 1 • c * t* + Ji N i "«] (1 3) 

The minimum value of m for which r^/m/ is non-zero is 

m mio2 ='j k N k ^j i N l -N k +1 (14) 
and the maximum value of m for which T !k [m] is non-zero is 

m mM2 ^N l -\-j k N k ^j l N l (15) 
The total number of non-zero elements is then 

m iotol = m mzx2 - fn nun2 +1 

(16) 

Table 1 below gives the number of bytes per l,k virtual-user pair based on 2 bytes per 
element - one byte for the real part and one byte for the imaginary part. 



Table 1. Number of bytes per i,k virtual user pair based on 2 bytes per element 





.N k = 256 


. 728. 


• ^;M- ; v. 


<w : :.QZ:: ;• 


f ;.# 78 


:V., ..\'8X 


: 4': ■• 


TV, = 256' 


1022 


766 


638 


574 


542 


526 


518 


v/ .728 


766 


510 


382 


318 


286 


270 


262 


:'64 \ 


638 


382 


254 


190 


158 


142 


134 


v '-;h32. v'f; 


u 574 


318 


190 


126 


94 


78 


70 


\:'t:.f6 : ;,.:. 


542 


286 


158 


94 


62 


46 


38 


■■'•'84.-. 


526 


270 


142 


78 


46 


30 


22 


•<#-•• • : 


518 


262 


134 


70 


38 


22 


14 



Now we are in a position to determine the memory requirements for the r matrix for a 
given number of users at each spreading factor. Let there be K q virtual users at spreading 
factor N q = 2*~ 9 , q = 0;6, where K q is the gth element of the vector K. Note that some 
elements of K may be zero. Let Table 1 above be stored in matrix M with elements M w >. 
For example, M 00 = 1022, and M 0 i = 766. The total memory required by the r matrix in 
bytes is then 
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For example, for 200 virtual users at spreading factor N 0 = 256 we have Kq = 2Q08 q0 , 
which gives M byte s = VzK 0 {K 0 + 1)/Woo= 1 00(201 )( 1022) =20.5 MB. 

For 10 384 Kbps users we have K q = KoSqo + K 6 8 q6 with K 0 = 10 and /C 6 = 640. This gives 
M byt(3S = VzK^Ko + + KaKeMoe + + = 5(11)(1022) + 10(640)(518) + 

320(641)(14) = 6.2 MB. 

Now consider addressing, storing and accessing the r-matrix data. For each pair (l,k), k 
>= / we have 1 complex value T !k [m] value for each value of m, where m ranges from rrw 
to nw>, and the total number of non-zero elements is m to ta} = rrimax2 - -f 7. Hence for 
each pair (7,/c), k >= I we have 2m fota / time-contiguous bytes. To access the data, create an 
array of structures: 

struct { 

intm_min2; 

int m__max2; 

intmjtotal; 

char*Glk; 
} GJnfo[N_VU_MAX][N_VU„MAX]; 

The C-matrix data is then retrieved using something like: 

rrw = GJnfo[l][k].m_min2 
flW? = GJnfo[l][k].m_max2 

Ng = Lg/N C 

N1 =m'*N~Lg/{2N c ) 
form' =0:1 

forq = 0:L-1 

forq'=0:L-1 

1 = m'T+1i q — Xfrq' 

Wminl = N1 - ni q + n kq > 

rrimaxl = mminl + N g 

rrimjn = max[ m mjn1 , ] 
mmax = min[ /r?^/ , ] 

m S pan = /77max - m min + 7 

sum 7 = 0.0; 

pfr* = &GJnfo[l][k].GIk[m mif J 
ptr2 = &g[ m^n *N c + %] 
while m sp an > 0 

sum 1 += C *ptr1++ )*( *ptr2++ ) 

end 

C[m'][l][k][q][q'] = sum1 

end 

end 

end 

end 
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5. Estimated Processing Times 

The following processing times are estimated below: 

• Calculate r-matrix elements 

• Write to r-matrix elements to SDRAM 

• Pack r-matrix elements in SDRAM 

• Extract r-matrix elements/Form C-matrix from SDRAM 

• Write C-matrix elements to L2 cache 

• Pack C-matrix elements in L2 cache 

Processing times are calculated for two cases of interest. The first case is where /C= 100 
users (K v = 200 virtual users) are accessing the system and a voice user is added to the 
system. Not all of these users are active. The control channels are always active, but the 
data channels have activity factor AF = 0.4. The mean number of active virtual users is 
then K + AF*K = 140. The standard deviation is cr = jK AF (l-AF) =4.90. With high 
probability, then, we have K v < 140 + 3a < 155 active users. 

The second case is the worst case scenario. This occurs when a number of voice users 
are accessing the system and a single 384 Kbps data user is added. A single 384 Kbps 
data user adds interference equal to (.25 + 0.125*100)/(.25 + 0.400*1) ~= 20 voice users. 
Hence, the number of voice users accessing the system must be reduced to 
approximately K = 1 00 - 20 = 80 (K v = 160). The 3a number of active virtual users is then 
80 + (0.125)80 +3(3.0) = 99 active virtual users. The reason this scenario is stressful is 
that when a single 384 Kbps data user is added to the system, J + 1 =64+1 = 65 virtual 
users are added to the system. 



Calculate r-matrix elements 

The r-matrix elements can be calculated in one of two ways. The first is using the SAL 
zconvx to perform the direct convolution. The second is using the SAL fft_zipx to perform 
the calculation via the FFT. The first method is preferable when the vector lengths are 
small. SAL timing are given in Table 2. These timings are based on a 400 MHz PPC7400 
with 160MHz, 2MB L2 cache. The data is assumed resident in L1 cache. The 
performance loss for data L2 cache resident is not severe. 



Table 2. SAL timings and GFLOPS for zconvx function 





: N, -i 


.Timing (|is) 


..'G FLOPS; 


1024 


4 


19.33 


1.70 


1024 


8 


29.73 


2.20 


1024 


16 


50.55 


2.59 


1024 


32 


92.32 


2.84 


1024 


64 


176.53 


2.97 


1024 


128 


346.80 


3.47 



The time to perform a 512 complex FFT, with in -place calculation (fft_zipx), on a 400 MHz 
PPC7400 with 160MHz, 2MB L2 cache is 10.94 |is for data L1 resident. Prior to 
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performing the (final) FFT we must perform a complex vector multiply of length 512. The 
SAL timings for zvmulx are given in Table 3. 



Table 3. SAL timings and GFLOPS forzvmulx function 



0 Length . . : 


Location 


Timing (|iJs) 


GFLOPS 


1024 


L1 


4.46 


1.38 


1024 


L2 


24.27 


0.253 


1024 


DRAM 


61.49 


0.100 



We will also be interested in the time to move data. Hence the SAL timings forzvmovx are 
given in Table 4. 



Table 4. SA 


L timings forzvmovx function 


., Length 


. Location- 


^ -Timing (ji^) 1 : , 


1024 


L1 


1.20 


1024 


L2 


15.34 


1024 


DRAM 


30.05 



Figure 2 shows the elements that must be calculated (in gray) when a physical user is 
added to the system. When a physical user is added to the system there are 1 + J virtual 
users added to the systems: that is, 1 control channel + J = 256/SF data channels. The 
number K v represents the number of virtual users that are using the system to begin with. 



Ik 



Columns k 



1 J 



Rows 1 



Kv 



Figure 2. Elements that must be calculated (in gray) 
when a physical user is added to the system. 

Hence there are (K v + 1) elements added due to the control channel, and J{K V + 1) + J{J + 
1)/2 elements added due to the data channels. The total number . of elements added is 
then (J+1)[K V +1 + J/2]. 
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Suppose that the FFT is used to perform the calculations. The total number of FFTs to 
perform is (J + 1) + (J + 1)[Kv + 1 + J/2]. The first term represents the FFTs to transform 
C/t/h;, and the second term represents the (J + 1)[/C V + 1 + J/2] inverse FFTs of 
FFT{C/(/lr?7}*FFT{C/Y/?7}. The time to perform the complex 512 FFTs is 10.94 us, whereas 
the time to perform the complex vector multiply and the complex 512 FFT is 24.27/2 + 
10.94 = 23.08 jxs. 

For the first scenario there are K v = 200 virtual users accessing the system and a voice 
user is added to the system ( J = 1). The total time to add the voice user is then (1 + 
1)(10.94 ps) + (1 + 1)[200 + 1 + 1/2](23.08 jis) = 9.3 ms. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J= 64). The total time to add the 384 Kbps user is 
then (64 + 1)(10.94 \xs) + (64 + 1)[160 + 1 + 64/2](23.08 \is) = 290 ms! This number is 
way too big and hence for high data-rate users, at least, the r-matrix elements must be 
calculated via convolutions. 



The direct method to calculate the r-matrix elements is to use the SAL zconvx function to 
perform the convolution 



l n=0 



1 Nk ~ l 

= ^7T Y, c ^ n + Jk N k + '»] ' c k In + j k N k ] 



(18) 



For each value of m there are N^ n = min{M, N k ) complex macs (cmacs). Each cmac 
requires 8 flops, and there are m to tai = Nj+ Ak - 1 m-values to calculate. Hence the total 
number of flops is 8A/,n/ n (/V/ + N k - 1). For what follows we assume the convolution 
calculation is performed at 1 .50 GOPs = 1500 ops/ps. The calculation time to perform the 
convolutions is presented in Table 5. 



Table 5. Calculation timedxs) to perform the r-matrix convolutions. 





lN k ±256;: 


> r .i28 :~ 




32 


• 


,-'.h.8. ' 


■6:m?' - 


Wik256 


697.69 


261.46 


108.89 


48.98 


23.13 


11.22 


5.53 


■ ■ 128 


261.46 


174.08 


65.19 


27.14 


12.20 


5.76 


2.79 


64 


108.89 


65.19 


43.35 


16.21 


6.74 


3.03 


1.43 


■ ■ '. 32 


48.98 


27.14 


16.21 


10.75 


4.01 


1.66 


0.75 


" 16 ■ ; 


23.13 


12.20 


6.74 


4.01 


2.65 


0.98 


0.41 


: ■,}■$} \>- 


11.22 


5.76 


3.03 


1.66 


0.98 


0.64 


0.23 


--/..4- v ; 


5.53 


2.79 


: 1.43 


0.75 


0.41 


0.23 


0.15 



The shaded cells indicate times faster than the 23.08 us FFT time. Equation 17 gives the 
size of the r-matrix in bytes. Similarly, the total time to calculate the r-matrix is 
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'iKJK. +1) 6 

z 9=0 [ ^=0 J 




(19) 




where 7 W are the elements in Table 5. Now suppose K' = K+ A, where A q = JJiq* + J y 5 qy , 
where x and y are not equal. Then 

AT r Br r (jr')-r r (jf) 



For the first scenario there are /C„ = 200 virtual users accessing the system and a voice 
user is added to the system ( J= 1). Hence we have Kg = Ky5 q0 (SF = 256), K v = 200, J x = J 
= 2 and J y = 0. The total time is then 

1 / 2 J(J+ 1)7^ + JK v Too = (0.5)(2)(3)(0.70 ms) + (2)(200)(0.70 ms) = 283 ms 

This number is way too big and hence for voice users, at least, the r-matrix elements 
must be calculated via FFTs. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J= 64). Hence we have Kg = KvS q0 (SF = 256), K v 
= 1 60, J x = 1 (control) and J y = J = 64 (data). The total time is then 

(Kv + 1 ) Too + J{K + 1 ) Toe + ( J + 1 )( J/2) 

= (161)(697.7 US) + (64)(161)(5.53 fis) + (65)(32)(0.15 jis) = 

1 12.33 ms + 56.98 ms + 0.31 ms = 169.62 ms 

Since T 00 = 697.7 us is so large, these calculations should be performed using the FFT, 
which costs 23.08 |is per convolution. We also have 1 FFTs to compute FFT{c k *[n]}) for 
the single control channel. This costs an additional 10.94 us. The total time, then, to add 
the 384 Kbps user is 

10.94 lis + (161)(23.08) ^s + (64)(161)(5.53) us + (65)(32)(0.15) |is = 
= 61 .02 ms 



= ±7,(7, +Dr„ +±J y (J y + l)T yy +JJ y T xy +±K g {j x T xq +J y T yq ] 



(20) 
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Write to r-matrix elements to SDRAM 

The numbers in Table 1 represent the 2m to taj bytes per r-matrix element. Recall that the 
size of the r-matrix in bytes from Equation 17 is 



Now suppose K'= K+A, where A q = + JySqy, where x and y are not equal. Then 



Consider the first scenario where Kq = 2008 q0 (SF = 256) and that a single voice user is 
added to the system: J x = 2 (data plus control), and J y = 0. The total number of bytes is 
then 0.5(2)(3)(1022) + 200(2)(1022) = 0.412 MB. The SDRAM write speed is 133MHz*8 
bytes * 0.5 = 532 MB/s. The time to write to SDRAM is then 0.774 ms. 

Now for the second scenario K q = 1605 QO {SF = 256), and that a single 384 Kbps (SF = 4) 
user is added to the system: J x = 1 (control) and J y = 64 (data). The total number of bytes 
is then 0.5(1 )(2)(1 022) + 0.5(64)(65)(14) + 160{1(1022) + 64(518)} = 5.498 MB. The 
SDRAM write speed is 133MHz*8 bytes * 0.5 = 532 MB/s. The time to write to SDRAM is 
then 10.33 ms. 



Pack r-matrix elements in SDRAM 

The maximum total size of the r-matrix is 20.5 MB. Suppose that in order to pack the 
matrix every element must be moved. This is the worst case. The SDRAM speed is 
133MHz*8 bytes * 0.5 = 532 MB/s. The move time is then 2(20.5 MB)/(532 MB/s) = 77.1 
ms. If the r-matrix is divided over three processors this time is reduced by a factor of 3. 
The packing can be done incrementally, so there is no strict time limit. 




(21) 



dM b ^M b {K')-M h {K) 



(22) 
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Extract r-matrix elements/Form C-matrix from SDRAM 

Recall that the C-matrix data is retrieved using something like: 

= GJnfo[l][k].m_min2 
mmax2 = G_info[l][k]. m_max2 
N 9 = L/N C 

N1 =m'*N-L</{2N c ) 
form' =0:1 

forq = 0:L-1 

forq' = 0:L-1 

Mmini = N1-ni q + n kq > 
rrimaxi = rrimjni + Ng 
nrimtn = max[ m^/ , ] 
rrimax = min[ /7Wj , nw? ] 
ifm max >= nrimin 

sum1 = 0.0; 

pfr? = SiGJafofflklGlklmMn} 
ptr2 = &g[m min *N c +%] 
while rrispan > 0 

sum 1 += C *ptr1++ )*( *ptr2++ ) 

Wlspan — 

end 

C[m'][l][k][q][q'] = sum1 

end 

end 

end 

end 



Time to extract elements when a new user is added to the system 

We calculated above the time to calculate the r-matrix elements when a new user is 
added to the system. Here we consider the time to extract the corresponding C-matrix 
elements. 

Notice that Glk[m] are accessed from SDRAM. Values will almost certainly not be in either 
L1 or L2 cache. For a given (\,k) pair, however, the spread in x will for most cases be less 
than 8 ^ (i.e for a 4 \xs delay spread), which equates to (8 \i$){4 chips/jxs)(2 bytes/chip) = 
64 bytes, or 2 cache lines. Since data must be read in for two values of m' a total of 4 
cache lines must be read. This will require 16 clocks, or about 16/133 = 0.12 jlls. 
However, measured results for zvmovx indicate that accesses to SDRAM are performed 
at about 50% efficiency so that the required time is about 0.24 \is. 

Now suppose, for example, user / = x is added to the system. We must fetch the elements 
C[m'][x][k][q][q r ] for all m\ k } q and q\ As indicated above, all the m\ q and q' values will 
be contained typically in 4 cache lines. Hence if there are K v virtual users we must read in 
4K V cache lines, or 32K, clocks, where we have doubled the clocks to account for the 50% 
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efficiency. In general J + 1 virtual users are added to the system at a time. This will 
require + 1) clocks. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to read in the Omatrix elements will be 32(155)(1 + 1) 
clocks/(133 clocks/fis) = 74.6 jis. The industry standard hold time t h for a voice call is 140 
s. The average rate X of users added to the system can be determined from Xt h = K, 
where K \$ the average number of users using the system. For K= 100 users we have X = 
100/140 s = 1 users added per 1.4 s. 

For the case where we have 99 active virtual users and a 384 Kbps user is added to the 
system, the time required to read in the Omatrix elements will be 32(99)(64 + 1) 
clocks/(133 clocks/|LLs) = 1.55 ms. However data users presumably will be added to the 
system more infrequently than voice users. 



Time to extract elements when t*y changes 

Now suppose, for example, user / = x lag q = y changes. Then we must fetch the elements 
C[m'][x][k][y][q'] for all m\ /rand q\ All the g' values will be contained typically in 1 cache 
line. Hence we must read in 2(K^)(1) = 2K V cache lines, or 16K, clocks, where we have 
doubled the clocks to account for the 50% efficiency. In general, when a lag changes 
there are J + 1 virtual users for which the C-matrix elements must be updated. This will 
require 1 6Ky( J + 1 ) clocks. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to read in the C-matrix elements will be 16(155)(1 + 1) 
clocks/(133 clocks/(is) = 37.3 fis. Recall that for high mobility users such changes should 
occur at a rate of about 1 per 100 ms per physical user. This equates to about once per 
1.33 ms processing interval if there are 100 physical users so that approximately 37.3 fis 
will be required every 1 .33 ms. 

For the case where we have 99 virtual users and a 384 Kbps data user's profile (one lag) 
changes, the time required to read in the C-matrix elements will be 16(99)(64 + 1) 
clocks/(133 clocks/M-s) = 0.774 ms. However data users will haye lower mobility and hence 
such changes should occur infrequently. 



Write C-matrix elements to L2 cache 

Time to write elements when a new user is added to the system * 

Consider again the case where user / = x is added to the system. We must write elements 
C[m'][x][k][q][q'] for all m', k, q and q\ If there are K v active virtual users we must write 
4K V L 2 bytes, where we have doubled the bytes since the elements are complex. In general 
J + 1 virtual users are added to the system at a time. This will require 4K V L ( J + 1 ) bytes to 
be written to L2 cache. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to write the C-matrix elements will be 4(155)(16)(1 + 1) 
bytes/(2128 bytes/ps) = 9.3 us. 
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For the second case where we have 99 active virtual users and a 384 Kbps user is added 
to the system, the time required to write the C-matrix elements will be 4(99)(16)(64 + 1) 
bytes/(2128 bytes/ps) = 193.5 \xs. Recall, however, that data users presumably will be 
added to the system more infrequently than voice users. 



Time to extract elements when %^ changes 

Now suppose, for example, user / = x lag q = y changes. We must write elements 
C[m'][x][k][q][q'] for all m\ /cand q\ If there are K v active virtual users we must write 4K V L 
bytes, where we have doubled the bytes since the elements are complex. In general J+ 1 
virtual users are added to the system at a time. This will require 4K V L(J + 1) bytes to be 
written to L2 cache. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to write the C-matrix elements will be 4(155)(4)(1 + 1) 
bytes/(2128 bytes/us) = 2.33 \jls. 

For the second case where we have 99 active virtual users and a 384 Kbps data user's 
profile (one lag) changes, the time required to write the C-matrix elements will be 
4(99)(4)(64 + 1)bytes/(2128 bytes4is) = 48.4 lis. However data users will have lower 
mobility and hence such changes should occur infrequently. 



Pack C-matrix elements in L2 cache 

The C-matrix elements will need to be packed in memory every time a new user is added 
to or deleted from the system and every time a new user becomes active or inactive. The 
size of the C-matrix is 2{3/2)(K v L) 2 = 3(K V L) 2 bytes, however, divided over three 
processors this becomes {K v Lf bytes per processor. Assume that the entire matrix must 
be moved. The move is within L2 cache. Hence the total move time is 2(K V L) 2 bytes/(2128 
bytes/^s), where the factor of 2 accounts for read and write. 

For the first case where we have 155 active virtual users the time required to move the C - 
matrix elements will be 2(155*4) 2 bytes/(2128 bytes/^s) = 0.361 ms. 

For the first case where we have 99 active virtual users the time required to move the C- 
matrix elements will be 2(99*4) 2 bytes/(21 28 bytes/^is) = 0.1 47 ms. 

These events will occur typically once every 10 ms, that is, once per frame. 
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6. Summary and Conclusions 

In summary, we have determined 

• The T-matrix will require approximately 20.5 MB of SDRAM 

• To efficiently calculate the r-matrix elements will require both direct convolution and 
FFT calculations 

• To pack the r matrix in SDRAM will require approximately 77.1 ms 



The following processing times are estimated: 



Estimated Processing Times 


Case 1 

: (voice user added) 


Case 2 
(384 Kbps user added) 


Calculate r-matrix elements 


9.3 ms 


61.0 ms 


Write T-matrix elements to SDRAM 


0.77 ms 


10.3 ms 


Extract C-matrix elements when 
New user added 
Multipath profile changes 


75 fis 
37 lis 


1.6 ms 
0.77 ms 


Write C-matrix elements to 12 when 
New user added 
Multipath profile changes 


9.3 [is 
2.3 ns 


194 jis 
48 us 


Pack C-matrix elements in L2 cache 


361 \is 


147 us 



These times are based on a single but devoted G4 allocated to perform the calculations. 
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The C-matrix elements can be represented in terms of the underlying code 
correlations using 

= -^- y LY J S\in-p)N c +mT+f„ -* v Yc t [p}-c,[nl 

= Xg[^+T]-|-£c;tn]-c i [n-/n] (1) 



The r-matrix represents the correlation between the complex user codes. The 
complex code for user / is assumed to be infinite in length, but with only Ni non- 
zero values. The non-zero values are constrained to be ±1± j . The r-matrix can 
represented in terms of the real and imaginary parts of the complex user codes 
becomes 
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where 



T lk [m] m ~ ^c*[n) - c k [n - mj 

= T^T I fe" W ■«**[»-«]+ < [n] ■ c' k [n - m] < 2 > 
+ jcf [n] ■ c[ [n - m] - jcj [n] • cf [n - m]} 

=r»[i B ]+r;[in]+/jr^[m]-r J «[m]} 



(3) 



r '" [m]= ^rS c ' [ " ]c * ,[ "- m] 

Consider any one of the above real correlations, denoted 

ri?[m)=^Zc?[n]c Y k [n-m] (4) 

where X and V can be either R or /. Since the elements of the codes are now 
constrained to be ± 1 or 0, we can define 

cf[n] = ^2yf{ni)mf[n\ (5) 

where yf[n] and mf[n\ are both either zero or one. The sequence mf[n] is a 
mask used to account for values of cf[n] that are zero. With these definitions 
Equation (4) becomes 
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(6) 



[ro] = -i-X (l - lyf [«])• mf [n] ■ (l - 2y\ [n - m))- m r k [n - m] 
=^rE( 1 - 2 r* I"])" (l -2y t r [n-m])- mf[n]m r k [n-m] 

2/V, n 

- 2XG'/ X M ® f k in - m]\ mf [n] • m\ [n - m]| 
= ^K>»)-2<[m)} 

tmlsX-rn'M/nnn-m] 
tyf [m] = X fo* W ® r* r t» -»'])• «i Z [«]•«» !« " '«] 

where © indicates modulo-2 addition (or logical XOR). 

The hardware to perform these operations is shown in Figures 1 - 3. Figure 1 
shows the initial register configuration after loading code and mask sequences. 
The boolean functions are shown in Figure 2, and Figure 3 shows the register 
configuration after a number of shifts. 



Load mask & code for 
user k here (256 chips) 



Masklc 
Code It 



Load mask & code for 
user 1 here (256 chips) 



Maskl 



Codel >WVl*.f,K 



I 



Initialize with 



Sum 



Boolean operations 



Output 



Figure 1. Initial register configuration after loading code and mask sequences. 
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Figure 2. Boolean functions. 



Maskk 
Codck 



Shift mask & code right 
I chip at a time 



Maskl 11 
codel 



7~ 



Load zeros in from left 



Perform a total of 512 shifts, 
shifting mask k and code k 
out of registers at right. 



Sum 



Output 



Figure 3. Register configuration after a number of shifts. 



The above hardware calculates the functions M,f[m] and Njj*[m]. The 
remaining calculations to form r,f [m] and subsequently T lk [m] can be 
performed in software. Note that the four functions r,f [m] corrsponding to X, Y = 
Ft, I which are components of r lk [m] can be calculated in parallel. For K v = 200 
virtual users, and assuming that 10% of all (/, /c) pairs must be calculated in 2 ms, 
then for real-time operation we must calculate 0.10(200) 2 = 4000 T lk [m] elements 
(all shifts) in 2 ms, or about 2M elements (all shifts) per second. For K v = 128 
virtual users the requirement drops to 0.81 92M elements (all shifts) per second. 

In what has been presented therjm] elements are calculated for all 512 shifts. 
Not all of these shifts are needed, so it is possible to reduce the number of 
calculations per T lk [m\ elements. The cost is increased design complexity. 
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# 

.SUFFIXES: .a -C .mac .o .S 

ARCH = ppc7400 
MUDLIB = mudlib.a 

###CFLAGS = -Ot -t ${ARCH} -I. -DCOMPILE_C 

CFLAGS = -Ot -t $[ARCH} -I. 

AS FLAGS = -t ${ARCH} -DBUILD_MAX -I. 

# 

# Make object files 
# 

.CO: 

ccmc ${CFIiAGS} -o $*.o -c $*.c 

# 

# Make ASM 
.mac . o: 

rm -f $*.S 

cp $*.mac $*.S 

ccmc ${ AS FLAGS} -o $*.o -c $*.S 
rm -f $*.S 

OBJS - \ 

get sizes. o \ 
get sizes v.o \ 
reformat corr.o \ 
rmats.o \ 
reformat_r.o \ 
mpic.o \ 
gen x row.o \ 
gen r sums . o \ 
gen r sums2.o \ 
gen r matrices. o \ 
mtrans32 8bit.o \ 
mtriangle 8bit.o \ 
dotpr3 8bit.o \ 
dotpr6 8bit.o \ 
dotpr9 8bit.o \ 
sve3 8bit.o \ 
fixed cdotpr.o \ 
zdotpr4 vmx.o \ 
zdotpr_vmx . o 

${MUDLIB}: Makefile ${OBJS} 
armc -c $@ ${0BJS} 

# 

# Cleanup 
# 

clean: 

rm -f ${OBJS} *.S ${MUDLIB} 

get sizes. o: mudlib.h get_sizes.c 

ref ormat_corr . o : mudlib.h ref ormat_corr .c 

rmats.o: mudlib.h rmats.c \ 

gen x row. mac gen r_sums.mac gen__r_sums2 .mac 

gen r matrices. mac 
reformat r.o: mudlib.h reformat_r.c 
mpic.o: mudlib.h mpic.c \ 

dotpr3 8bit.mac dotpr6_8bit .mac dotpr9_8bit .mac 

sve3_8bit .mac 

dotpr3 8bit.o: dotpr3 8bit.mac salppc.inc 
dotpr6_8bit .o: dotpr6_8bit .mac salppc.inc 
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dotpr9 8bit.o: dotpr9 8bit.mac salppc.inc 

sve3 8bit.o: sve3 8bit.mac salppc.inc 

fixed cdotpr.o: zdotpr4 vmx.mac salppc.inc 

zdotpr4_vmx . o : zdotpr4_vmx . mac zdotpr4_vmx . k salppc.inc 
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#include "mudlib.h" 

#define DO CALC STATS 0 

ftdefine DO TRUNCATE 1 

#define DO SATURATE 1 

. #def ine DOJ3QUELCH 0 

#define SQUELCH THRESH 1.0 
#def ine TRUNCAT E_B I AS 0 . 0 

#if DO TRUNCATE 

#define S ATURATE_THRE S H (128.0 + TRUNCAT E__B I AS) 
#else 

#define SATURATE_THRESH 127.5 
#endif 

#define SATURATE { f ) \ 
{ \ 

if ( (f) >= SATURATE THRESH ) f = (SATURATE THRESH - 1-0); \ 
^ else if ( (f) < -SATURATE_THRESH ) f = - SATURATE JTHRESH; \ 

#if DO TRUNCATE 
#if 0 *~ 

#define BF8_FIX( f ) ( (BF8) (FABS(f) <= TRUNCATE BIAS) ? 0 : \ 

({(f) > 0.0) ? ((f) - TRUNCATE BIAS) : \ 
((f) + TRUNCATEJBIAS) ) ) 

#define BF8_FIX( f ) ((BF8)(f)) 
#else 

#define BF8 FIX( f ) ( (BF8) ( { ( ( (f ) < 0.0)) && ((f) (f loat) ( (int) (f ) ) ) ) ? 
\ 

#endif 



((f) + 1.0) : (f))) 



#else 

#define BF8_FIX( f ) ((BF8)(((f) >= 0.0) ? (<f)+0.5) : ((f) -0.5))) 
#endif 

#define UPDATE MAX( f, max ) \ 

if ( FABS ( f ) > max ) max « FABS ( f ) ; 

#define uchar unsigned char 
#def ine ushort unsigned short 
#define ulong unsigned long 

#if DO_CALC STATS 

static float max_R_value; 

#endif 

void gen X row ( 

COMPLEX BF16 *mpathl bf , 
COMPLEX BF16 *mpath2_bf , 
COMPLEX BF16 *X_bf , 
int phys index, 
int tot_jphys_users 

) ? 

void gen R sums ( 

COMPLEX BF16 *X bf , 
COMPLEX BF8 *corr_bf , 
uchar *ptov map, 
BF32 *R sums, 
int num_phys_users 

) i 

void gen_R_sums2 ( 
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COMPLEX BF16 *X bf , 
COMPLEX BF8 *corra bf , 
COMPLEX BF8 *corrb_bf , 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int numj?hys_users 



) 



void gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf , 
BF8 * scale row bf , 
int num virt users 

) ; 



void mudlib gen R ( 
COMPLEX BF16 
COMPLEX BF16 
COMPLEX BF8 
COMPLEX BF8 
uchar 
float 
float 
float 
char 
BF8 



*mpathl bf, 
*mpath2 bf, 
*corr 0 bf, 
*corr_l_bf , 



) 



*ptov map, 
*bf scalep, 
*inv scalep, 
*scalep, 
*L1 cachep, 
*R0 upper bf , 
BF8 *R0 lower bf , 
BF8 *R1 trans_bf, 
BF8 *Rlm bf , 
int tot phys users, 
int tot virt users, 
int start phys user, 
int start virt user, 
int end phys user, 
int end virt user 



/* adjusted for starting physical user */ 

/* adjusted for starting physical user */ 

/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* start at 0 ! th physical user */ 

/* start at O'th physical user */ 

/* temp: 32K bytes, 32 -byte aligned */ 



/* zero-based starting row (inclusive) */ 
/* relative to start phys user */ 
/* zero-based ending row (inclusive) */ 
/* relative to end_phys_user */ 



COMPLEX BF16 *X bf ; 
BF32 *R sumsO, *R sumsl; 
uchar *R0_ptov_map; 

int bump, byte offset, i, iv, last virt user; 

int R0_align, RO_skipped__virt_users, R0_tcols, R0_virt_users, Rl_tcols; 

#if DO CALC STATS 

max R value = 0.0; 
#endif 

XjDf = ( COMPLEX J3F16 *)Ll_cachep; 

byte offset = tot phys users * NUM FINGERS SQUARED * sizeof (COMPLEX BF16) ,- 
R_jsumsO = (BF32 *) (((ulong)X bf + byte offset + R MATRIX ALIGN MASK) & 
~R_MATRIX_ALIGN_MASK) " " ~ 

byte offset = tot virt users * sizeof (BF32) ,* 

R_sumsl = (BF32 *) (((ulong)R sumsO + byte offset + R MATRIX ALIGN MASK) & 
~R_MATRI X_ALIGN_MASK ) ; ~ ~ ~ 

R0 _j?tov_map - (uchar *) ({(ulong)R sumsl + byte offset + 

R_MATRIX_ALI GN_MASK) & - R_MATRIX__ALIGN_MASK ) / 

Rl_tCols = (tot_virt_users + R_MATRIX_ALIGN_MASK) & ~ R_MATRI X_AL I GN_MASK ; 
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R0 virt_users = 0; 

for ( i = startjhys user; i < tot phys_users; i++ ) { 
R0 virt users += (int)ptov map[i]; 
R0__ptovjnap[i] = ptovjmap [i] ; 



R0 ptov map [start phys user] -= start virt user; 

R0 skipped virt users = tot virt users - R0_virt_users + start_virt_user ; 
R0_virt_users -= (start_virt__user + 1) ; 

--inv_scalep; /* predecrement to allow for common indexing */ 

for ( i = start_phys_user; i <= end_phys_user; i++ ) { 

gen X row ( 
mpathl bf, 
mpath2_bf , 
X bf , ' 
i, 

tot_phys__users 



--R0_ptov_map[i] ; /* excludes R0 diagonal */ 

last_virt_user = (i < end_jphys_user) ? ((int)ptov map[i] - 1) : 

end_virt_user ,- 

for ( iv = start_virt_user; (iv + 1) <= last_virt_user; iv 2 ) { 

gen R sums 2 ( 

X bf + (i * NUM_FINGERS_SQUARED) , 
corr 0 bf, 

corr 0 bf + ( (R0_virt_users - 1) * NUM_FINGERS_SQUARED) , 
R0 ptov_map + i, 

R sumsO + (RO skipped virt users + 1) , 
R sumsl + (RO skipped_jvirt_users + 1) , 
tot_phys_users - i 



R0 tcols = Rl tcols - (R0 skipped_virt users & ~R MATRIX_ALIGN_MASK) ; 
R0_align = (R0_skipped_virt_users & R_MATRIX_ALIGN_MASK) + 1; 

gen R matrices ( 

R sumsO + (R0_skipped__virt_users + 1) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 1) , 
scalep + (R0 skipped virt_users + 1) , 
R0 lower bf + R0 align, 
R0 upper bf + R0_align, 
R0 virt users 

) ; 

R0_upper_bf [ R0_align -13=0; /* zero diagonal element */ 

R0 lower bf += RO tcols; 
R0_upper_bf += R0_tcols; 

R0_tcols = Rl_tcols - ( (R0 skipped virt users + 1) & 

~R MATRIX ALIGN MASK) ; 
R0_align = ( (R0_skipped__virt_users + 1) & R__MATRIX_ALIGN_MASK) + 1; 

gen R matrices ( 

R sumsl + (R0_skipped_virt_users + 2) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 2) , 
scalep + (R0 skipped virt_users + 2), 
R0_lower_bf + R0_align, 
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R0 upper bf + RO align, 
R0_virt users - 1 

) ; 

RO_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

R0 lower bf += R0 tcols; 
R0_upper_bf += R0_tcols; 

/* 

* create ptov map[i] number of 32-element dot products involving 
*^ X_bf [i] and corr__l_bf [i] [j 3 where 0 < j < ptov_map[i] 

gen R sums2 ( 
X bf , 

corr 1 bf , 

corr 1 bf + (tot_virt_users * NUM_FINGERS_SQUARED) , 

ptov map, 

R sumsO, 

R sumsl, 

tot_phys_users 



/* 

* scale the results and create two output rows (1 per matrix) 
*/ 

gen R matrices ( 
R sumsO, 
bf scalep, 

inv scalep + (R0__skipped_virt_users + 1) , 

scalep, 

Rl trans_bf , 

Rim bf r 

tot__virt users 

); 

Rl trans bf Rl tcols; 
Rlm_bf += Rl_tcols; 

gen R matrices ( 
R sumsl, 
bf scalep, 

inv scalep + (R0_skipped_virt_users + 2) , 

scalep, 

Rl trans_bf, 

Rim bf , 

tot virt users 

>; 

Rl trans bf += Rl tcols; 
Rlm_bf += Rl_tcols; 

corr 0 bf += (((2 * R0 virt users) - 1) * NUM FINGERS SQUARED); 
corr 1 bf += { (2 * tot_virt_users) * NUMJFINGERS SQUARED); 
RO ptov map[i] -= 2; ~" 
RO virt users -= 2; 
RO_skipped_virt_users +- 2; 



if ( iv <= last_virt_user ) { 

bump m ro ptov_map[ i ] ? 0 : 1; 
gen R sums ( 

X bf + { (i + bump) * NUM_FINGERS SQUARED) , 

corr 0 bf, 

RO ptovjnap + i + bump, 

R_sums0 + (R0_skipped_virt_users + 1) , 
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tot_phys_users - i - bump 

) i 

R0 tcols = Rl tcols - (R0 skippedjvirt users & ~R MATRI X__ALI GN_MAS K ) ; 
R0_align = (R0_skipped_virt_users & R_MATRIX_ALI GN_MAS K ) + 1; 

gen R matrices ( 

R sumsO + (R0_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (R0 skipped virt users + 1) , 
scalep + (R0 skipped virt_users + 1) , 
R0 lower bf + RO align, 
R0 upper bf + R0_align, 
R0_virt_users 

) ; 



R0_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

R0 lower bf += R0 tcols; 
R0_upper_bf += R0_tcols; 

/* 

* create ptov map[i] number of 32-element dot products involving 

* X_bf [i] and corr l_bf [i] [j] where 0 < j < ptov map[i3 
*/ 

gen R sums ( 
X bf , 

corr 1 bf , 

ptov map, 

R sumsO, 

to t_phys_us e r s 

>; 
/* 

* scale the results and create two output rows (1 per matrix) 
*/ 

gen R matrices ( 
R sumsO, 
bf scalep, 

inv scalep + (R0_skipped_virt__users + 1) , 

scalep, 

Rl trans_bf, 

Rim bf , 

tot_virt_users 

); 

Rl trans bf += Rl tcols; 
Rlm__bf += Rl_tCols; 

corr 0 bf += (R0 virt users * NUM FINGERS SQUARED) ; 
corr 1 bf += { tot_virt_users * NUM_FINGERS_SQUARED) ; 
R0 ptov map [i] -= 1; 
R0 virt users -= 1; 
R0_skipped_virt_users += 1; 

start_virt_user =0; /* for all subsequent passes */ 



#if DO CALC STATS 

printf ( "max R value = %f\n n , max_R_yalue ); 

if ( max_R_yalue > 127.0 ) 

printf ( "***** OVERFLOW *****\n" ); 
#endif 
} 

#if COMPILE C 
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void gen X row ( 

COMPLEX BF16 *mpathl bf , 
COMPLEX BF16 *mpath2_bf , 
COMPLEX BF16 *X_bf , 
int phys index, 
int tot_phys_users 



{ 



COMPLEX BF16 *in mpathlp, *in mpath2p; 

COMPLEX_BF16 *out_mpathlp, *out_mpath2p ; 

int i, j, g, gl; 

BF32 sir, sli, s2r, s2i; 

BF32 air, ali, a2r, a2i; 

BF32 . cr, ci; 

out mpathlp = mpathl bf + (phys index * NUM FINGERS) ; 
out_mpath2p = mpath2_bf + (phys_index * NUM_FINGERS) ; 

for { i = 0; i < tot_jphys_users; i++ ) { 

in mpathlp = mpathl bf + (i * NUM FINGERS); /* 4 complex values */ 
in_mpath2p = mpath2_bf + (i * NUM_F INGERS ) ; /* 4 complex values */ 

j = 0; 

for ( gl = 0; gl < NUM_FINGERS; gl++ ) { 

sir = (BF32)out mpathlp [gl] .real; 

sli * (BF32)out mpathlp [gl] .imag; 

s2r = (BF32)out mpath2p [gl] . real; 

s2i = (BF32)outjnpath2p[gl] .imag; 

for ( g m 0; g < NUM_FINGERS; g++ ) { 

air - (BF32)in mpathlp [g] . real ; 

ali = (BF32)in mpathlp [g] . imag; 

a2r = (BF32)in mpath2p [g] . real ; 

a2i = (BF32)in_mpath2p[g] .imag; 

cr = (air * sir) + (ali * sli) ; 
ci = (air * sli) - (ali * sir) ; 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci +=: (a2r * s2i) - (a2i * s2r) ; 

X bf [i * NUM FINGERS SQUARED + j] .real » (BF16) (cr » 16); 
X bf[i * NUM_FINGERS_SQUARED + j].imag = (BF16) (ci » 16); 



} 



} 



J 



void gen R sums ( 

COMPLEX BF16 *X bf , 
COMPLEX BF8 *corr_bf , 
uchar *ptov map, 
BF32 *R sums, 
int num_jphys_users 

) 



int i, j, k; 
BF32 sum; 

for ( i = 0; i < num phys users; i++ ) { 
for ( j = 0; j < (int)ptovjnap[i] ; j++ ) { 
sum = 0; 
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for ( k = 0; k < 16; k++ ) { 

sum += (BF32)X bf[k].real * (BF32)corr bf->real; 
sura += (BF32)X__bf [k] .imag * (BF32) corr_bf ->imag; 
++corr bf; 

) 

*R sums++ = sura; 
} " 

X bf +■ NUM FINGERS SQUARED ; 



void gen R sums 2 ( 

COMPLEX BF16 *X bf , 
COMPLEX BF8 *corra bf , 
COMPLEX BF8 *corrb_bf , 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int numjphys_users 



) 



{ 



int i, j, k; 
BF32 suma, sumb; 

for ( i » 0; i < num phys users; i++ ) { 
for { j = 0; j < (int)ptov_map[i] ; j++ ) { 
suma = 0; 
sumb ~ 0; 

for ( k = 0; k < 16; k++ ) { 

suma += (BF32)X bf[k].real * (BF32)corra bf->real; 
suma += <BF32)X bf[k].imag * (BF32)corra bf->imag; 
sumb += (BF32)X bf[k}.real * (BF32)corrb bf->real; 
sumb += <BF32)X_bf [k] .imag * (BF32) corrb_bf ->imag; 
++corra bf ; 
++corrb bf ; 

} 

*R sumsa++ = suma; 
*R sumsb++ = sumb; 

} " 

XJbf += NUM_FINGERS_SQUARED ; 

} } 

void gen R matrices ( 
BF32 *R sums, 
floa't *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf , 
BF8 *scale row bf , 
int num_virt_users 

) 

{ 

int i; 

float bf_scale, fsum, fsum_scale, inv_scale, scale; 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users ; i++ ) { 
scale = scalep [i] ; 
fsum = (float) (R sumsfi] ) ; 
fsum *= bf_scale; 

fsum_scale = fsum * inv_scale; 
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fsum_scale *= scale; 

#if DO CALC STATS 

UPDATE MAX( fsum scale, max Revalue ) 

UPDATE_MAX( fsum, max__R_value ) 
#endif 

#if DO_SQUELCH 

if ( FABS ( fsum_scale ) <= SQUELCH THRESH ) fsum scale = 0.0; 
if ( FABS ( fsum ) <= SQUELCH_THRESH ) fsum « 0.0; 
#endif 

#if DO SATURATE 

SATURATE ( fsum_scale ) 

SATURATE ( fsum ) 
#endif 

no scale row bf[i] = BF8 FIX{ fsum ); 
scale_row_bf [i] = BF8_FIX( fsum_scale ); 



#endif /* COMPILE_C */ 
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/* + 

MC Standard Algorithms PPC Macro language Version 



File Name: dotpr3_8bit .mac 

Description: Source code for routine which computes three 
dot products, combining the three sums prior 
to exit. 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date Engineer Reason 



0.0 000510 fpl Created 
0.1 000521 fpl Added num cached_rows 
0.2 000521 fpl Changed to fixed point 
0.3 000605 fpl Changed to .k file 
0.4 000926 jg Back to .mac and no dsts 
+ */ 



# include " salppc . inc 11 

^define LVX_BT( vT, rA, rB ) LVX( vT, rA, rB ) 

#define FUNC ENTRY dotpr3 8bit 

#define VMSUM( vT, vA, vB, vC ) VMSUMMBM{ vT, vA, vB, vC ) 

#define LOOP COUNT SHIFT 6 

#define HALF BLOCK BIT 0x20 

#define QUARTER_BLOCK_B IT 0x10 

#define LOOP_BLOCK_SIZE 64 
/** 

Input parameters 
**/ 

#def ine btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
#define hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
#define indexl rl2 
#define index2 rl3 

#define index3 rO 
#define icount hat_tc 

/** 

G4 registers 
**/ 

#define rqlO vO 

#define rqll vl 

#define rql2 v2 

#define rql3 v3 

#def ine zero v3 

#define rqOO v4 
#define rqOl v5 
#define rq02 v6 
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#define rq03 v7 

#define rqlmO v8 
#define rqlml v9 
#define rqlra2 vlO 
#define rqlm3 vll 

tfdefine btlmO v!2 
^define btlml vl3 
#def ine btlm2 vl4 
#define btlm3 vl5 

#define btlO v!6 
ttdefine btll vl7 
#define bt!2 vl8 
#define btl3 vl9 

#define btOO v20 
#define btOl v21 
#define bt02 v22 
#define bt03 v23 

#define sumO v24 
ftdefine suml v25 
ftdefine sum2 v26 
#define sum3 v27 

/** 

Begin code text 

Setup loop registers, test for zero N 
**/ 

FUNC PROLOG 

ENTRY 7( FDNC_ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 

USE__THRU_y27< VRSAVE_COND ) 
/** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat_tc) 
VXOR ( sumO , sumO , sumO ) 

ADD(btlptr, btOptr, hat_tc) ( 

LI (indexl, 16) 

VXOR ( suml , suml , suml ) 

LI(index2, 32) 

VXOR ( sum2 , sum2 , sum2 ) 

LI(index3, 48) 

VXOR ( sum3 , sum3 , sum3 ) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) /* 32 sum updates per loop trip */ 
BEQ ( do_hal f_bl ock ) 

/* + 

Loop entry code 
* */ 

LVX( rql0 # 0 f rlptr ) 
LVX{ rqll, rlptr , indexl ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
DECR__C(i count) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

BR ( mid_loop ) 
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Loop computes three dot products held in 16 parts 
* * I 

LABEL ( loop ) 

/* { */ 

LVX( rqlO, 0, rlptr ) 
VMSUM( sumO, rqlmO, btlO, sumO ) 
LVX( rqll, rlptr, index! ) 
VMSUM( sural, rqlml, btll, suml ) 
LVX( rql2, rlptr, index2 ) 
VMSUM( sum2, rqlm2, btl2, sura2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C ( icount ) 

LVX BT( btlmO, 0, btlraptr ) 

VMSOM( sum3, rqlm3, btl3, sum3 ) 

LVX BT{ btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT{ btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LABEL { mid_loop ) 

LVX( rqOO, 0, rOptr ) 
VMSUM{ sumO, rqlO, btlmO, sumO ) 
LVX( rqOl, rOptr, indexl ) 
VMSUM( suml, rqll, btlml, suml ) 
LVX( rq02, rOptr, index2 ) 
VMSUM( sum2, rql2, btlm2, sum2 ) 
LVX( rq03, rOptr, index3 ) 

LVX BT{ btOO, 0, btOptr ) 

VMSUM( sum3, rql3, btlm3, sum3 ) 

LVX BT( btOl, btOptr, indexl ) 

ADDI (rOptr , rOptr, LOOP BLOCK SIZE) 

LVX BT( bt02, btOptr, index2 ) 

LVX BT{ bt03, btOptr, index3 ) 

ADDI (btOptr, btOptr, LOOP_BLOCK_SIZE) 

LVX( rqlraO, 0, rlmptr ) 
VMSUM{ sumO, rqOO, btOO, sumO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUM( suml, rqOl, btOl, suml ) 
LVX( rqlm2, rlmptr, index2 ) 
VMSUM( sum2, rq02, bt02, sum2 ) 
LVX( rqlm3, rlmptr, index3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM{ sum3, rq03, bt03 , sum3 ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, LOOP BLOCKJSIZE) 

LVX BT( btl2, btlptr, index2 ) 

LVX BT( btl3, btlptr, index3 ) 

ADDI (btlptr, btlptr, LOOP_BLOCK_SIZE) 

/* } */ 

BNE ( loop ) 
/** 

Loop exit code 
**/ 

VMSUM( sumO, rqlmO, btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 
VMSUM( sum2, rqlm2, btl2, sum2 ) 
VMSUM( sum3, rqlm3, btl3, sum3 ) 

/** 

Remainders 
**/ 

LABEL (do_half_block) 
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AND I C( icount, N, HALF_BLOCK_BIT ) 

BEQ (do quarter block) 

LVX( rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT{ btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 
VMSUM( suml, rqll, btlml, suml ) 

LVX( rqOO, 0, rOptr ) 

LVX( rqOl, rOptr, indexl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKrOptr, rOptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btOptr, btOptr, (LOOP_BLOCK__SIZE >> 1) ). 

VMSUM( sumO, rqOO, btOO, sumO ) 
VMSUM( suml, rqOl, btOl, suml ) 

LVX{ rqlmO, 0, rlmptr ) 

LVX( rqlml, rlmptr, indexl ) 

LVX BT( btlO, 0, btlptr } 

LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumO, rqlmO, btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER_BLOCK_BIT ) 

BEQ (combine) 

LVX( rqlO, 0, rlptr ) 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 

LVX( rqOO, 0, rOptr ) 
LVX BT( btOO, 0, btOptr ) 
VMSUM( sumO, rqOO, btOO, sumO j 

LVX( rqlmO, 0 f rlmptr ) 
LVX BT( btlO, 0, btlptr ) 
VMSUM( sumO, rqlmO, btlO, sumO ) 

/** 

Combine sums and return 
**/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 

VADDSWS( sumO, sumO, suml ) /* sOO sOl s02 s03 */ 
VADDSWS( sum2, sum2, sum3 ) /* S22 s21 s22 s23 */ 
VADDSWS( sumO, sumO, sum2 ) /* sOO sOl s02 s03 */ 
VSUMSWS( sumO, sumO, zero ) /* xxx xxx xxx sOO */ 
VSPLTW( sumO, sumO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumO, 0, C ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU_v27( VRSAVE_COND ) 

REST rl3 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: 
Description: 



Revision 

0.0 
0.1 
0.2 
0.3 
0.4 



dotpr6_8bit .mac 

Source code for routine which computes six 
dot products, combining the six sums prior 
into two outputs prior to exit. 

Mercury Computer Systems, Inc. 
■Copyright (c) 2000 All rights reserved 

Date Engineer Reason 

000510 fpl Created 

000521 fpl Changed to fixed point 

000521 fpl Added num cached rows 

000605 fpl Changed to .k file 

000926 jg Back to .mac and no dsts 



-*/ 



#include "salppc.inc" 

#define LVXJ3T( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x20 
#define QUARTER_BLOCKJBIT 0x10 

#define LOOP_BLOCK_SIZE 64 
/** 

Input parameters 
** / 

#define btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
#define hat_tc r9 
/** 

Local loop registers 
* * / 

#define btOptr rlO 
#def ine btlptr rll 
#define bt2ptr rl2 
#define indexl rl3 
#define index2 rl4 

#define index3 rO 
#define icount hat_tc 

I ** 

G4 registers 
** / 

#define rqlO vO 
^define rqll vl 
#define rql2 v2 
#define rql3 v3 
#define zero v3 

#define rqOO v4 
#define rqOl v5 



LVX( vT, rA, rB ) 
dotpr6 8bit 

VMSUMMBM ( vT, vA, vB, vC ) 
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#define rg02 v6 
#define rq03 v7 

#define rqlmO v8 
#define rqlml v9 
ttdefine rqln\2 vlO 
#define rqlm3 vll 

# define btlmO vl2 
#define btlml vl3 
#define btlm2 vl4 
#define btlra3 vl5 

#define btlO vl2 
#define btll vl3 
#define btl2 vl4 
#define btl3 vl5 

#define btOO vi6 
ttdefine btOl vl7 
#define bt02 vl8 
#define bt03 vl9 

#define bt20 vl6 
#define bt21 vl7 
#define bt22 vl8 
#define bt23 vl9 

#define sumOO v20 
#define sumOl v21 
#define sum02 v22 
#define sura03 v23 



#define sumlO v24 
#define surall v25 
#define suml2 v26 
ttdefine suml3 v27 
/** 

Begin code text 

•kit/ 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat tc ) 
SAVE rl3 rl4 

USEJTHRU v27 ( VRSAVE COND ) 
/** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat tc) 
VXOR(sum00, sumOO, sumOO) 
ADD(btlptr, btOptr, hat_tc) 
LI(indexl, 16) 
ADD(bt2ptr, btlptr, hat_tc) 

VXOR(sum01 / sumOl, sumOl) 
LI ( index2 , 32) 
VXOR(sum02, sum02, sum02) 
LI(index3, 48) 
VXOR(sum03, sum03, sum03) 



VXOR{ sural 0, sural 0, sural 0) 

VXOR{sumll, sumll, sumll) 

VXOR(suml2, suml2, suml2) 

VXOR(suml3, sural3, sural3) 
SRWI C(icount, N, LOOP_COUNT SHIFT) 
BEQ (do_Jialf_block) 

/ * * 

Loop entry code 
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**/ 

LVX BT( btlmO, 0, btlmptr ) 
DECR C(icount) 

LVX BT( btlml, btlmptr, indexl ) 
LVX BT( btlm2, btlmptr, index2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX{ rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

ADDI (btlmptr, btlmptr, LOOPJBLOCKJSIZE) 

LVX( rql2, rlptr, index2 ) 

LVX( rql3, rlptr, index3 ) 

BR ( mid loop ) 

/** 

Loop computes three dot products held in 16 parts 
** I 

LABEL ( loop ) 

/* { */ 

LVX BT( btlmO, 0, btlmptr ) 
VMSUM( sumlO, rqlraO, bt20, sumlO ) 
LVX BT( btlml, btlmptr, indexl ) 
VMSUM( sumll, rqlml, bt21, sumll > 
LVX BT( btlm2, btlmptr, index2 ) 
DECR C(icount) 

VMSUM( suml2, rqlm2, bt22, suml2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( suml3, rqlm3 , bt23, suml3 ) 

LVX( rqll, rlptr, indexl ) 

LVX{ rql2, rlptr, index2 ) 

ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 

LVX( rql3, rlptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LABEL ( mid_loop ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 

LVX BT( btOl, btOptr, indexl ) 

VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT{ bt02, btOptr, index2 ) 

VMSUM( sum02, rql2, btlm2, sum02 ) 

LVX BT( bt03, btOptr, index3 ) 
ADDI (rlptr, rlptr, LOOP_BLOCK_SIZE) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sum03, rql3, btlm3, sum03 ) 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( sumll, rqll, btOl, sumll ) 

ADDI (btOptr, btOptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rql2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

VMSUM( suml3, rql3, bt03, suml3 ) 

LVX BT( btlO, 0, btlptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 

LVX BT( btll, btlptr, indexl ) 
ADDI (rOptr, rOptr, LOOP BLOCK_SIZE) 

LVX BT( btl2, btlptr, index2 ) 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BTi( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02, sum02 ) 

LVX( rqlmO, 0, rlmptr ) 
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VMSUM( sum03 f rq03, bt03, sum03 ) 
ADDI (btlptr, btlptr, LOOP BLOCK SIZE) 
VMSUM {sumlO, rqOO, btlO, sumlO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUM ( sumll, rqOl, btll, sumll ) 
LVX( rqlm2, rlmptr, index2 ) 
VMSUM ( suml2 , rq02 , * " ' 
LVX ( rqlm3, rlmptr, 
ADDI ( r Impt r , rlmptr , 



3/9/2001 



btl2, suml2 ) 
index3 ) 

LOOP BLOCK SIZE) 



LVX BT( bt20, 0, bt2ptr ) 

VMSUM ( suml3, rq03, btl3, suml3 ) 

LVX BT( bt21, bt2ptr, indexl ) 

VMSUM { sumOO, rqlmO, btlO, sumOO ) 

LVX BT( bt22, bt2ptr, index2 ) 

VMSUM ( sumOl, rqlml, btll, sumOl ) 

LVX BT{ bt23, bt2ptr, index3 ) 

VMSUM { sum02, rqlm2, btl2, sum02 ) 

VMSUM { sum03, rqlm3, btl3, sum03 ) 

/* } */ 

BNE( loop ) 
/** 

Loop exit code 
**/ 

VMSUM ( sumlO, rqlmO, bt20, sumlO ) 
VMSUM ( sumll, rqlml, bt21, sumll ) 
ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 
VMSUM ( suml2, rqlm2, bt22, suml2 ) 
VMSUM ( suml3, rqlm3, bt23, suml3 ) 

/** 

Remainders 
** i 

LABEL (do half block) 

ANDI C( icount, N, HALF_BLOCK_B IT ) 
BEQ (do_quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE >> 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM ( sumOO, rqlO, btlmO, sumOO ) 
VMSUM ( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM ( sumlO, rqlO, btOO, sumlO ) 
VMSUM ( sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDI (rOptr , rOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 
VMSUM ( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM ( sumlO, rqOO, btlO, sumlO ) 
VMSUM ( sumll, rqOl, btll, sumll ) 
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LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP__BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) 
VMSUM( sumll, rqlml, bt21, sumll ) 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER__BLOCK_BIT ) 
BEQ (combine) 

LVX BT( btlmO, 0, btlmptr ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rqOO, 0, rOptr ) 

VMSUM{ sumOO, rqOO, btOO, sumOO ) 



LVX BT( btlO, 
VMSUM( sumlO, 
LVX( rqlmO, 0 
VMSUM( sumOO, 
LVX BT{ bt20, 
VMSUM( sumlO, 

Combine sums and return 
** / 

LABEL (combine) 

VXOR( zero, zero, zero 



0, btlptr ) 
rqOO, btlO, sumlO ) 

rlmptr ) 
rqlmO, btlO, sumOO ) 
0, bt2ptr ) 
rqlmO, bt20, sumlO ) 



VADDSWS( sumOO, 


sumO 0 , 


sumOl 


) 


/* 


sOO 


sOl 


s02 


S03 


*/ 


VADDSWS( sural 0, 


suml 0 , 


sumll 


) 














VADDSWS( sum02, 


sum02, 


sum03 


) 


/* 


s22 


821 


S22 


S23 


*/ 


VADDSWS( sural 2, 


suml 2, 


suml 3 


) 














VADDSWS( sumOO, 


sumOO, 


sum02 


) 


/* 


sOO 


SOI 


S02 


s03 


*/ 


VADDSWS( sumlO, 


sumlO, 


suml 2 


) 














VSUMSWS( sumOO, 


sumOO, 


zero ) 




/* 


XXX 


XXX 


XXX 


sOO 


V 


VSUMSWS( suml 0, 


sumlO, 


zero ) 














VSPLTW( sumoo, 


sumOO , 


3 ) ' 




/* 


sOO 


SOO 


SOO 


sOO 


*/ 


STVEWX{ sumOO, 


0, C ) 



















ADDI ( C, C, 4 ) 
VSPLTW( sumlO, sumlO, 3 ) 
STVEWX ( sumlO, 0, C ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU v27 ( VRSAVE_COND ) 

REST rl3_rl4 

RETURN 
FUNC EPILOG 
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File Name: dotpr9_8bit .mac 

Description: Source code for routine which computes nine 
dot products, combining the nine sums prior 
into three outputs prior to exit. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000510 fpl Created 

0.1 000512 fpl Added num cached^ rows 

0.2 000521 fpl Changed to fixed point 

0.3 000605 fpl Changed to .k file 

0.4 000926 jg Back to .mac and no dsts 



#include "salppc.inc" 

#define LVX_BT{ vT, rA, rB ) LVX( vT, rA, rB ) 

#define FUNC ENTRY dotpr9 8bit 

#define VMSUM( vT, vA, vB, vC ) VMSUMMBM ( vT, vA, vB, vC ) 

#define LOOP COUNT SHIFT 6 

#define HALF BLOCK BIT 0x20 

#define QUARTER__BLOCK_B IT 0x10 

#define LOOP_BLOCK_SIZE 64 
/** 

Input parameters 
*+/ 

#define btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
#define hat tc r9 
/** 

Local loop registers 
**/ 

ttdefine btOptr rlO 
#define btlptr rll 
^define bt2ptr rl2 
#define bt3ptr rl3 
#define indexl rl4 
^define index2 rl5 

#define index3 rO 
^define icount hat_tc 

/** 

G4 registers 
** i 

#define rqlO vO 
#define rqll vl 
#define rql2 v2 
ttdefine rql3 v3 
#define zero v3 

#define bt30 vO 
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#define bt31 vl 
ttdefine bt32 v2 
#define bt33 v3 

tfdefine rgOO v4 
#define rqOT v5 
#define rq02 v6 
#define rq03 v7 

#define rqlmO v8 

#define rqlml v9 

#define rqlm2 vlO 

^define rqlm3 vll 

#define btlmO vl2 
#define btlml vl3 
#define btlm2 vl4 
#define btlm3 vl5 

#define btlO vl2 
#define btll vl3 
#define btl2 vl4 
#define btl3 vl5 

#define btOO vl6 
#define btOl vl7 
#define bt02 vl8 
^define bt03 vl9 

#define bt20 vl6 
#define bt21 vl7 
#define bt22 vl8 
#define bt23 vl9 

#define sumOO v20 
#define sumOl v21 
#define sum02 v22 
#define sum03 v23 

ttdefine sumlO v24 
#define sumll v25 
#define suml2 v26 
#define suml3 v27 

#define sum20 v28 
#define sum21 v2 9 
#define sum22 v30 
#define sum23 v31 



/** 

Begin code text 
** / 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat tc ) 
SAVE. r 13 rl5 ~ 
USE_THRU__v3l { VRSAVE_COND ) 

/** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat tc) 
VXOR(sumOO, sumOO, sumOO) 
ADD{btlptr, btOptr, hat tc) 
Llfindexl, 16) 
ADD(bt2ptr / btlptr, hat tc) 
VXOR(sum01, sumOl, sumOl) 
ADD(bt3ptr f bt2ptr, hat_tc) 
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LI(index2, 32) 
VXOR(sum02, sum02, sum02) 
LI(index3, 48) 
VXOR ( sumO 3 , sumO 3 , sumO 3 ) 

VXOR ( suml 0 , suml 0 , suml 0 ) 

VXOR (sumll, sumll, sumll) 

VXOR(suml2, suml2, suml2) 

vxOR(suml3, suml3, suml3) 

VXOR(sum20, sum20, sum20) 
VXOR { sum2 1 , sum2 1 , sum2 1 ) 
VXOR(sum22, sum22 r sum22) 
VXOR ( sum2 3 , sum2 3 , sum2 3 ) 
SRWI C(icount, N, LOOP__COUNT_SHIFT) 
BEQ (do_half_block) 

/** 

Loop entry code 
★ * I 

LVX BT{ btlmO, 0 f btlmptr ) 
LVX BT( btlml, btlmptr, indexi ) 
DECR C(icount) 

LVX BT( btlm2, btlmptr, index2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

ADDI (btlmptr, btlmptr, LOOPJ3LOCK_SIZE) 
LVX{ rgll, rlptr, indexi ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
LVX_BT( btOO, 0, btOptr ) 
BR( mic\_loop ) 

/** 

Nine dot products producing 3 sums: 
sumO = (Rl * Btlm) (R0 * BtO) (Rim * Btl) 
suml = (Rl * BtO) (R0 * Btl) (Rim * Bt2) 
SUm2 a (Rl * Btl) (R0 * Bt2) (Rim * Bt3) 
** I 

LABEL ( loop ) 
/* { */ 

LVX BT( btlmO, 0, btlmptr ) 

VMSDM( sum20, rqlmO, bt30, sum20 

LVX BT( btlml, btlmptr, indexi ) 

VMSUM( sum21, rqlml, bt31, sum21 ) 

LVX BT( btlm2, btlmptr, index2 ) 

VMSUM( sum22, rqlm2, bt32, sum22 ) 

LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 
VMSUM( sum23, rqlm3, bt33, sum23 ) 
ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexi ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* R0 * Bt2 */ 
LVX( rql2, rlptr, index2 ) 
VMSUM( sum21, rqOl, bt21, sum21 ) 
DECR C(icount) 

VMSUM( sum22, rq02, bt22, sum22 ) 
LVX( rql3, rlptr, index3 ) 



) /* Rim * Bt3 */ 



VMSUM{ SUTO23, 
LVX BT( btOO, 

/** 
Loop entry 
**/ 

LABEL ( mid_loop ) 
VMSUM{ sumOO, 



rq03, bt23, sum23 ) 
0, btOptr ) 



rqlO, btlmO, sumOO ) /* Rl * Btlm */ 



LVX_BT( btOl, btOptr, indexi ) 
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ADDI(rlptr, rlptr, LOOP BLOCK SIZE) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX_BT( bt03, btOptr, index3 ) 

VMSUM( sum02, rql2, btlm2, sum02 ) 

LVX( rqOO, 0, rOptr ) 

VMSUM< sum03, rq!3, btlm3, sum03 ) 

ADDI(btOptr, btOptr, LOOP_BLOCK_SIZE) 

VMSUM{ sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rq01 7 rOptr, indexl ) 

VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( suml2, rq!2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

ADDI(rOptr, rOptr, LOOP BLOCK SIZE) 

VMSUM( suml3, rql3 , bt03, suml3 ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 

LVX BT( btl2 f btlptr, index2 ) 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX_BT< btl3, btlptr, index3 ) 

VMSUM{ sum02, rq02, bt02, sum02 ) 
VMSUM{ sum03, rq03, bt03, sum03 ) 
LVX{ rqlmO, 0, rlmptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl +/ 

LVX( rqlml, rlmptr, indexl ) 

VMSDM( sum21, rqll, btll, sum21 ) 

LVX{ rqlra2, rlmptr, index2 ) 

ADDI (btlptr, btlptr, LOOP BLOCK_SIZE) 

LVX( rqlm3, rlmptr, index3 ) 

VMSUM( sum22, rql2, btl2, sum22 ) 

LVX BT{ bt20, 0, bt2ptr ) 

VMSUM( sum23, rql3, btl3, sum23 ) 

LVX BT( bt21, bt2ptr, indexl ) 

VMSUM{ sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 

ADDI (rlmptr, rlmptr, LOOP_BLOCK_SIZE) 

VMSUM( sumll, rqOl, btll, sumll ) 

LVX BT( bt22, bt2ptr, index2 ) 

VMSUM( sum!2, rq02, btl2, suml2 ) 

LVX_BT< bt23, bt2ptr, index3 ) 

VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT( bt30, 0, bt3ptr ) 

LVX BT( bt3l, bt3ptr, indexl > 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 

LVX BT( bt32, bt3ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX_BT( bt33, bt3ptr, index3 ) 

VMSUM( sum02, rqlm2, btl2, sum02 ) 

VMSUM( sum03, rqlm3, btl3, sum03 ) 

ADDI(bt2ptr, bt2ptr, LOOP BLOCK SIZE) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 */ 

VMSUM( sumll, rqlml, bt21, sumll ) 

ADDI(bt3ptr, bt3ptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rqlm2 , bt22, suml2 ) 

VMSUM( suml3, rqlm3, bt23, suml3 ) 

/* } */ 

BNE ( loop ) 

/** 

Loop exit code 
**/ 
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rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 

rqlml, bt31, sum21 ) 

rqlm2, bt32, sum22 ) 

rqlm3, bt33, sum23 ) 

rqOO, bt20, sum20 ) /* RO * Bt2 */ 

rqOl, bt21, sum21 ) 

rq02, bt22, sum22 ) 

rq03, bt23, sum23 ) 



dotpr9J3bit .mac 

VMSUM{ sum20, 
VMSUM( sum21, 
VMSUM( sum22, 
VMSUM{ sum23, 
VMSUM( sum20, 
VMSUM( sum21, 
VMSUM( sum22, 
VMSUM( sum23, 

/** 

Remainders 
** / 

LABEL (do half block) 

ANDI C( i count, N, HALF_BLOCK_BIT ) 
BEQ (do_quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCKJ3IZE >> 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 
VMSUM( sumll, rqll, btOl, sumll ) 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDI (rOptr, rOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM{ sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 
VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 
VMSUM( sum21, rqll, btll, sum21 ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 
VMSUM( sumll, rqOl, btll, sumll ) 

LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt20, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM( sum21, rqOl, bt21, sum21 ) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 */ 

VMSUM( sumll, rqlml, bt21, sumll ) 
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LVX BT( bt30, 0, bt3ptr ) 

LVX BT ( bt31, bt3ptr, indexl ) 

ADDI(bt3ptr, bt3ptr f (IiOOP_BLOCK_SIZE » 1) ) 

VMSUM( sum20, rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 
VMSUM( siin^l, rqlml, bt31, sum21 ) 

/** 

four more sums 
** J 

LABEL (do quarter block) 

ANDI C( icount, N, QUARTER_BLOCK_BIT ) 
BEQ (combine) 

LVX BT( btlmO, 0, btlmptr ) 
LVX( rqlO, 0, rlptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 
ADDI (btlmptr, btlmptr, 16) 

LVX BT ( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rgOO, 0, rOptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) /+ RO * BtO */ 

LVX_BT< btlO, 0, btlptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 

LVX( rqlmO, 0, rlmptr ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 
LVX BT( bt20, 0, bt2ptr ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 
VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 */ 

LVX BT ( bt30 r 0, bt3ptr ) 

VMSUM( sum20, rqlmO, bt30, sum20 ) /* Rim * Bt3 */ 

/** 

Combine sums and return 
**/ 

LABEL ( combine ) 

VXOR( zero, zero, zero ) 



sumOO, 


sumOO, 


sumOl 


suml 0 , 


sumlO, 


sumll 


sum20, 


sum20, 


sum21 


sum02, 


sum02, 


sum03 


suml 2, 


suml2, 


suml 3 


sum22, 


sum22, 


sum23 


sumOO, 


sumOO, 


sum02 


sumlO, 


suml 0 , 


suml 2 


sum20 , 


sum2 0 , 


sum22 


sumOO , 


sumO 0 , 


zero ) 


sumlO, 


suml 0 , 


zero ) 


sum20 , 


sum2 0 , 


zero ) 



/* XXX XXX XXX sOO 



VSPLTW( sumOO, sumOO, 3 ) /* sOO sOO sOO sOO 

STVEWX( sumOO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWX( sumlO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sum20, sum20, 3 ) 

STVEWX( sum20, 0, C ) 
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/** 

Return 
** j 

LABEL ( ret ) 

FREE THRU v31( VRSAVE_COND ) 

REST rl3_rl5 

RETURN 
FUNC EPILOG 
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#ifndef MCOS 55 
#define MC0S_55 0 
#endif 

/* + 

/* + 

MC Standard Algorithms -- 603e Macro language Version 



File Name: CDOTPR.MAC 

Description: Vector Single Precision Complex Dot Product 
Entry /params : CDOTPR (A, I, B, J, C, N) 
Formula: C[0] = sum (A[ml] *B [mJ] - A[ml+1] *B [mJ+1] ) 
C[l] » sum (A[mI]*B[mJ+l] + A[ml+1] *B [mJ] ) 
for m=0 to N-l 



Mercury Computer Systems, Inc. 
Copyright (c) 1995 All rights reserved 



Revision 


Date 


Engineer Reason 


0 


.0 


960502 


fpl 


Created 


0 


.1 


960618 


fpl 


Added Esal entry 


0 


.2 


970128 


fpl 


Added debt logic 


0 


.3 


970203 


fpl 


Corrected ABIT define 


0 


.4 


970522 


jfk 


Added new debx test macros 


0 


.5 


980325 


fpl 


Added 740 code segment 


0 


.6 


980404 


fpl 


Removed loop stall 


0 


.7 


980708 


fpl 


Added build macros 


0 


.8 


980820 


jfk 


Added new DCBT macro 


0 


.9 


981019 


fpl 


Added z function 


0 


.10 


981025 


fpl 


Modified z entry 


0 


.11 


990310 


fpl 


750/G4 integration 


0 


.12 


990730 


fpl 


Added conjugate entry 


1 


.0 


000223 


fpl 


Increased minimum VMX count 


1 


.1 


000305 


jfkremoved branches to entrypoints 


1 


.2 


000607 


jfk Fixed floating point save bug 


1 


.3 


000610 


fpl 


Added new API macro 



^include "salppc.inc" 
#undef BR IF VMX Z2 

#define BR_IF._VMX_Z2 ( root_name, uroot name, min n imm, unit_s_imm, \ 

prl, pil, si, pr2, pi2, s2 f n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s2, unit s imm; \ 

xor r0, prl, pil; \ 

bne z__skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, pr2, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ " 

bne z unaligned vmx; \ 

BR VMX Z2( root_name, eflag, si ) \ 
z_unaligned vmx: \ 

BR VMX Z2( uroot_name, eflag, si ) \ 
z_skip_vmx: 

#define ACOND 5 
#define ABIT 2 
#define BCOND 6 
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#define BBIT 1 

/** 

API registers 
** i 

#define A r3 
#define I r4 
tfdefine B r5 
#define J r6 
tfdefine C r7 
#define N r8 
#define EFLAG r9 

/** 

z input args 
**/ 

#define Ar A 

#define Ai rlO 

#def ine Br B 

#define Bi rll 

#define Cr C 

#define Ci rl2 

/** 

Local registers 
**/ 

#define count rl3 
#define rtrap rl3 
#define nextline rl4 



Fpu registers 



#def ine 


rsumrO fO 


#def ine 


rsumiO fl 


#define 


isumrO f2 


#def ine 


isumiO f3 


#def ine 


arO 


f4 


#def ine 


aiO 


f5 


#def ine 


arl 


f6 


#define 


ail 


f7 


#define 


ar2 


f8 


#def ine 


ai2 


f9 


#def ine 


ar3 


flO 


#define 


ai3 


fll 


#def ine 


brO 


fl2 


#define 


bio 


fl3 


#def ine 


brl 


fl4 


#def ine 


bil 


fl5 


#define 


br2 


fl6 


#def ine 


bi2 


f 17 


#define 


br3 


fl8 


#def ine 


bi3 


fl9 



#if defined ( BUILDJ4AX ) 
#if MCOS 55 

DE CLARE_VMX_Z 2 ( _zdotpr_vmx_cc ) 
#else 

DECLARE_VMX_Z2 ( _zdotpr_vmx ) 
#endif 

DECLARE_VMX__Z2 ( _zdotpr4_vmx ) 
#endif 

/** 

Code text: Conjugate 
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**/ 

FUNC PROLOG 
#ifndef COMPILE C 
U__ENTRY( fixed cidotpr 



) 



FORTRAN DREF 3( I, J, N ) 
U_ENTRY( fixed cidotpr ) 
LI( EFLAG, SAL NNN ) 
BR ( cidotprx common ) 
U__ENTRY( fixed cidotprx ) 

FORTRAN DREF 4 ( I, J, N, EFLAG ) 
U ENTRY ( fixed cidotprx ) 
LABEL ( cidotprx common ) 
4 ) 



/* c 
/* 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) */ 
common path */ 

/* Fortran ESAL */ 

ESAL */ 

common path */ 



4 ) 
4 ) 



/* common path */ 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) */ 
common path */ 

/* Fortran ESAL */ 

ESAL */ 

common path */ 



ADDI ( Ai, Ar, 
MR( Bi, Br ) 
ADDI ( Br, Br, 
MR( Ci, Cr ) 
ADDI ( Cr, Cr, 
BR( common ) 

/** 

Normal 
** I 

FUNC PROLOG 
#ifndef COMPILE C 
UJ3NTRY( fixed cdotpr_ ) 

FORTRAN DREF 3 ( I , J, N ) 
U_ENTRY( fixed cdotpr ) 

LI( EFLAG, SAL NNN ) /* 

BR ( cdotprx common ) /* 
UJENTRY( fixed cdotprx ) 

FORTRAN DREF 4 ( I, J, N, EFLAG ) 
U ENTRY ( fixed cdotprx ) /* C 

LABEL ( cdotprx common ) /* 

ADDI ( Ai, Ar, 4 ) 

ADDI ( Bi, Br, 4 ) 

ADDI ( Ci, Cr, 4 ) 

BR( common ) 

/** 

Split complex entries: Conjugate 
* * i 

U_ENTRY< fixed zidotpr ) 

FORTRAN DREF 3(1, J, N ) 
U_ENTRY( fixed zidotpr ) 

LI( EFLAG, SAL NNN ) /* NNN EFLAG (default) */ 

BR ( zidotprx common ) 
U_ENTRY ( fixed zidotprx ) 

FORTRANJDREF 4( I, J, N, EFLAG ) 
tfendif 

ENTRY 7( fixed zidotprx, 
LABEL ( zidotprx_common ) 
/** 

Assign split complex pointers, do the conjugate trick 
**/ 

LWZ( Ai, A, 4 ) 
LWZ( Ar, A, 0 ) 
LWZ( Bi, B, 0 ) 
LWZ( Br, B, 4 ) 
LWZ( Ci, C, 0 ) 
LWZ( Cr, C, 4 ) 
BR( z_common ) 

/** 

Normal 
**/ 

UJ2NTRY ( fixed zdotpr_ ) 

FORTRAN DREF 3(1, J, N ) 
U_ENTRY( fixed zdotpr ) 

LI( EFLAG, SAL NNN ) 

BR( zdotprx_common ) 



/* common path */ 



/* Fortran SAL */ 

/* C SAL * / 
/* NNN EFLAG (default) * 

/* Fortran ESAL */ 



A, I, B, J, C, N, EFLAG) 



/* Fortran SAL */ 

/* C SAL */ 
/* NNN EFLAG (default) * 
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U_ENTRY( fixed zdotprx ) /* Fortran ESAL */ 

FORTRAN_DREF_4 ( I, J, N, EFLAG ) 
#endif 

/** 
C ESAL 

**/ 

ENTRY 7{ fixed zdotprx, A, I, B, J, C, N, EFLAG) 
DECLARE rlO rl4 
DECLARE^ f 0_f 1 9 

LABEL ( zdotprx__common ) 
/** 

Assign split complex pointers 
**/ 

LWZ( Ai, A, 4 ) /* must load imag first since Ar reg = A reg */ 

LWZ( Ar, A, 0 ) 

LWZ( Bi, B, 4 ) 

LWZ( Br, B, 0 ) 

LWZ( Ci, C, 4 ) 

LWZ( Cr, C, 0 ) 

/** 

VMX API filter 

Test if okay to enter VMX code and branch to VMX code 
VMX loop - process all N points 

*v 

LABEL ( z_common ) 

#if defined ( BUILD_MAX ) 

#def ine MIN VMX N 20 
#define UNIT_STRIDE 1 

#if MCOS 55 

BR_IF_VMX_Z2 ( zdotpr_vmx cc, zdotpr4_vmx, MIN_VMX_N, UNIT_STRIDE , \ 
Ar, Ai, I, Br, Bi, J, N, EFLAG ) 

#else 

BR_IF_VMX_Z2 ( zdotpr_vmx, zdotpr4_vmx, MINJVMX_N, UNIT_STRIDE, \ 
Ar, Ai, I, Br, Bi, J, N, EFLAG ) 

#endif 

#endif /* BUILD_MAX */ 
/** 

Point of common path where all entries join 
Test for small counts 
**/ 

LABEL ( common ) 
SAVE rl3 rl4 
SAVE fl4 fl9 
CMPLWI (N, 0) 
BEQ(ret) 
CMPLWI (N, 1) 
BEQ(dol) 
CMPLWI (N, 2) 
BEQ (do2) 
CMPLWI (N, 3) 
BEQ (do3) 

/** 

check for uncached (and local) vectors 
**/ 

SET_2_DCBT_C0ND{ ACOND, ABIT, BCOND, BBIT, EFLAG, rtmp ) 
LKnextline, 32) 

/** 

740 code segment, start up loop code 
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#if defined ( BUILD 750 ) )| defined ( BUILD JVIAX ) 
LFS( arO, Ar, 0 ) 

SRWI{ count, N, 2 ) /* count = N » 2 */ 
LFS( brO, Br, 0 ) 

SLWI( I, I, 2 ) /* byte strides */ 

LFS( aiO, Ai, 0 ) 
SLWI( J, J, 2 ) 
LFS( biO, Bi, 0 ) 

LFSUX( arl, Ar, I ) 

LFSUX( brl, Br, J ) 

LFSUX( ail, Ai, I ) 

LFSUX( bil, Bi, J ) 

LFSUX( ar2, Ar, I ) 

LFSUX( br2, Br, J ) 

LFSUX( ai2, Ai, I ) 

LFSUX( bi2, Bi, J ) 

FMULS( rsumrO, arO r brO ) 
LFSUX( ar3, Ar, I ) 
LFSUX( br3, Br, J ) 
FMULS( rsumiO, aiO, biO ) 
LFSUX( ai3, Ai, I ) 
LFSUX< bi3, Bi, J ) 
FMULS< isumiO, arO, biO ) 
DECR C( count ) 
FMULS( isumrO, aiO, brO ) 
BEQ( flush loop_740) 
BR(mloop_740) 

/** 

Top Of 740 loop 
**/ 

LABEL (loop_740) 

LFSUX( ar3, Ar, 
FMADDS ( rsumrO , 
LFSUX( br3, Br, 
• FMADDS { rsumiO, 
LFSUX{ ai3, Ai, 
FMADDS { isumiO, 
FMADDS ( isumrO, 
LFSUX( bi3, Bi, 

LABEL (mloop_740) 

FMADDS ( rsumrO, 
LFSUX( arO, Ar, 

DCBT IF( ACOND, 
FMADDS { rsumiO, 
LFSUX{ brO, Br, 

DECR C( count ) 
FMADDS ( isumiO, 
LFSUX( aiO, Ai, 
FMADDS ( isumrO, 
LFSUX( biO, Bi, 

DCBT IF( BCOND, 
FMADDS ( rsumrO, 
LFSUX( arl, Ar, 
LFSUX( brl, Br, 
FMADDS ( rsumiO, 
LFSUX{ ail, Ai, 
FMADDS ( isumiO, 
LFSUX{ bil, Bi, 
FMADDS ( isumrO, 



I ) 

arO, brO, rsumrO ) 
J ) 

aiO, biO, rsumiO ) 
I ) 

arO, biO, isumiO ) 
aiO, brO, isumrO ) 
J ) 



arl, brl, rsumrO ) 
I ) 

Ar, next line ) 
ail, bil, rsumiO ) 
J ) 



arl, bil, isumiO ) 
I ) 

ail, brl, isumrO ) 
J ) 

Br, next line ) 
ar2, br2, rsumrO ) 
I ) 
J ) 

ai2, bi2, rsumiO ) 
I ) 

ar2, bi2, isumiO ) 
J ) 

ai2, br2, isumrO ) 
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FMADDS ( rsumrO, 
LFSUX( ar2, Ar, 
FMADDS ( rsumiO, 
LFSUX{ br2, Br, 
FMADDS ( isumiO, 
LFSUX{ ai2, Ai, 
LFSUX( bi2, Bi, 
FMADDS ( isumrO, 
BNE { loop_740 ) 



Finish last pass 
** / 

FMADDS ( rsumrO, 
LFSUX( ar3, Ar, 
LFSUX( br3, Br, 
FMADDS ( rsumiO, 
LFSUX( ai3, Ai, 
LFSUX( bi3, Bi, 
FMADDS ( isumiO, 
FMADDS ( isumrO, 



ar3, br3, rsumrO } 
I ) 

ai3, bi3, rsumiO ) 
J ) 

ar3, bi3, isumiO ) 
I ) 
J ) 

ai3, br3, isumrO ) 



arO, brO, rsumrO ) 
I ) 
J ) 

aiO, biO, rsumiO ) 
I ) 
J ) 

arO, biO, isumiO ) 
aiO, brO, isumrO ) 



LABEL ( flush loop 740 ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 

FMADDS ( rsumrO, ar2, br2, rsumrO ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

FMADDS ( isumiO, ar2, bi2, isumiO ) 

FMADDS ( isumrO, ai2, br2, isumrO ) 

FMADDS ( rsumrO, ar3, br3, rsumrO ) 

FMADDS ( rsumiO, ai3, bi3, rsumiO ) 

FMADDS ( isumiO, ar3, bi3, isumiO ) 

FMADDS ( isumrO, ai3, br3, isumrO ) 
BR (remain) 

#endif /** 750 specific code section **/ 
/** 

set up for loop entry, here if N >= 2 
**/ 

#if defined ( BUILD_603 ) 
LABEL (start 603) 



LFS( arO, Ar, 0 ) 
SLW1( I, I, 2 ) 
LFS( aiO, Ai, 0 ) 
SRWI( count, N, 2 
LFSUX{ arl, Ar, I 
SLWK J, J, 2 ) 
LFSUX( ail, Ai, I 
LFSUX( ar2, Ar, I 
LFSUX( ai2, Ai, I 
LFSUX( ar3, Ar, I 
LFSUX( ai3, Ai, I 



/* byte strides 
/ 



count 



N » 2 */ 



DCBT_IF( ACOND, Ar, nextline ) 



LFS( brO, Br, 0 
DECR_C( count ) 
LFS( biO, Bi, 0 
LFSUX{ brl, Br, 
LFSUX( bil, Bi, 
LFSUX( br2, Br, 
LFSUX( bi2, Bi, 
LFSUX( br3, Br, 
LFSUX( bi3, Bi, 



) 
) 

J ) 

J ) 

J ) 

J ) 

J ) 

J ) 
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DCBT_IF( BCOND, Br, nextline ) 

FMULS( rsumrO, arO, brO ) 

FMULS{ rsumiO, aiO, biO ) 

FMULS{ isumiO, arO, biO ) 

FMULS( isumrO, aiO, brO ) 

FMADDS ( rsumrO, arl, brl, rsumrO 
FMADDS ( rsumiO, ail, bil, rsumiO 
FMADDS ( isumiO, arl, bil, isumiO 
FMADDS ( isumrO, ail, brl, isumrO 

FMADDS ( rsumrO, ar2, br2, rsumrO 

FMADDS { rsumiO, ai2, bi2, rsumiO 

FMADDS ( isumiO, ar2, bi2, isumiO 

FMADDS ( isumrO, ai2, br2, isumrO 

FMADDS { rsumrO, ar3, br3 , rsumrO 

FMADDS ( rsumiO, ai3, bi3 , rsumiO 

FMADDS ( isumiO, ar3, bi3, isumiO 

FMADDS { isumrO, ai3, br3, isumrO 
BEQ ( remain ) 



2/23/2001 



main loop maintains four partial sums 
representing two complex sum updates per pass 
**/ 

LABEL (loop) 

LFSUX( arO, Ar, I ) 

LFSUX( aiO, Ai, I ) 

LFSUX( arl, Ar, I ) 

LFSUX( ail, Ai, I } 

LFSUX( ar2, Ar, I ) 

LFSUX( ai2, Ai, I ) 

LFSUX( ar3, Ar, I ) 

LFSUX( ai3, Ai, I ) 

DCBT_IF{ ACOND, Ar, nextline ) 

DECR C( count ) 

LFSUX( brO, Br, J ) 

LFSUX( biO, Bi, J ) 

LFSUX( brl, Br, J ) 

LFSUX{ bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 

LFSUX( br3, Br, J ) 

LFSUX( bi3, Bi, J ) 

DCBT_IF( BCOND, Br, nextline ) 



FMADDS 
FMADDS 
FMADDS 
FMADDS 

FMADDS 
FMADDS 
FMADDS 
FMADDS 

FMADDS 
FMADDS 
FMADDS 
FMADDS 

FMADDS 
FMADDS 
FMADDS 
FMADDS 



rsumrO, arO, brO, rsumrO 

rsumiO, aiO, biO, rsumiO 

isumiO, arO, biO, isumiO 

isumrO, aiO, brO, isumrO 



rsumrO , 
rsumi 0 , 
isumiO, 



arl, brl, 
ail, bil, 
arl, bil, 



isumrO, ail, brl 



rsumrO 
rsumi 0 
isumiO 
isumrO 



rsumrO, ar2, br2, rsumrO 

rsumiO, ai2 , bi2, rsumiO 

isumiO, ar2, bi2, isumiO 

isumrO, ai2, br2, isumrO 

rsumrO, ar3, br3, rsumrO 

rsumiO, ai3, bi3, rsumiO 

isumiO, ar3, bi3, isumiO 

isumrO, ai3, br3, isumrO 
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BNE { loop ) 
#endif /** 603 specific code section **/ 
/** 

remainder loop 
**/ 

LABEL (remain) 

ANDI_C( count, N, 2 ) /* bit 2 */ 
BEQ( suml ) 

LFSUX( arO, Ar, I ) 

LFSUX( aiO, Ai, I ) 

LFSUX( arl r Ar, I ) 

LFSUX( ail, Ai, I ) 



LFSUX( brO, Br, J ) 
LFSUX( biO, Bi, J ) 
LFSUX( brl, Br, J ) 
LFSUX( bil, Bi, J ) 

FMADDS ( rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 



LABEL (suml) 

ANDI_C( count, N, 1 ) 
BEQ( combine ) 



/* bit 0 */ 

/* if no sums left */ 



LFSUX( arO, Ar, I ) 

LFSUX( brO, Br, J ) 

LFSUX( aiO, Ai, I ) 

LFSUX( biO, Bi, J ) 

FMADDS ( rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 



/** rsumrO = rsumrO - rsumiO **/ 
/** MS + 0) = rsumrO **/ 



combine partial sums, write out results and return 
**/ 

LABEL (combine) 

FSUBS( rsumrO, rsumrO, rsumiO 
STFS{ rsumrO, Cr, 0 ) 
FADDS ( isumiO, isumiO, isumrO 
STFS( isumiO, Ci, 0 ) 
BR (ret) 

/** 

here for N = 1,2,3 
**/ 

LABEL (do3) 

LFS( arO, Ar, 0 ) 
SLWI( I, I, 2 ) 
LFS( aiO, Ai, 0 ) 
LFSUX( arl, Ar, I 
SLWI( J, J, 2 ) 
LFSUX( ail, Ai, I 
LFSUX( ar2, Ar, I 
LFSUX( ai2, Ai, I 



/* byte strides */ 



LFS( brO, Br, 0 ) 
DECR_C( count ) 
LFS( biO, Bi, 0 ) 
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LFSUX( brl, Br, J ) 

LFSDX( bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 



FMULS ( rsumrO; arO, brO ) 

FMULS ( rsumiO, aiO, bio ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 

FMADDS ( rsumrO , arl , - brl , rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsurniO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 



FMADDS ( rsumrO , ar2 , br2 , rsumrO ) 

FMADDS ( rsumiO, ai2, bi2 , rsumiO ) 

FMADDS ( isumiO, ar2, bi2 , isumiO ) 

FMADDS ( isumrO, ai2, br2 , isumrO ) 
BR (combine) 



LABEL (do2) 

LFS( arO, Ar, 0 ) 

SLWI{ I, I, 2 ) /* byte strides */ 

LFS( aiO, Ai, 0 ) 
LFSUX( arl, Ar, I ) 
SLWI( J, J, 2 ) 
LFSUX( ail, Ai, I ) 



LFS( brO, Br, 0 ) 
LFS( biO, Bi, 0 ) 
LFSUX( brl, Br, J ) 
LFSUX{ bil, Bi, J ) 

FMULS ( rsumrO, arO, brO ) 

FMULS ( rsumiO, aiO, biO ) 

FMULS ( isumiO, arO, biO ) 

FMULS ( isumrO, aiO, brO ) 



FMADDS ( rsumrO, arl, brl, rsumrO ) 

FMADDS ( rsumiO, ail, bil, rsumiO ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

FMADDS ( isumrO, ail, brl, isumrO ) 
BR (combine) 



LABEL (dol) 

LFS( aiO, Ai, 0 ) 

LFS( biO, Bi, 0 ) 

LFS( brO, Br, 0 ) 

LFS( arO, Ar, 0 ) 

FMULS ( rsumiO, aiO, biO) 

FMULS ( isumrO, aiO, brO) 

FMSUBS( rsumrO, arO, brO, rsumiO) 

STFS( rsumrO, Cr, 0 ) 

FMADDS ( isumiO, arO, biO, isumrO) 

STFS( isumiO, Ci, 0 ) 

/** 

return 
**/ 

LABEL (ret) 

REST fl4 fl9 

REST rl3_rl4 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms — PPC Macro language Version 



File Name: GEN R SUMS. MAC 

Description: Multiple small dot product routine for wireless 
group application. 

Entry /params : 

GEN_R_SUMS {X_bf, Coor_bf, Ptovjmap, R_sums, Num_phys_users) 

Formula: 

num_sums = 0; 

for ( i = 0; i < Num phys users; i++ ) { 
for ( j = 0; j < (int) Ptov_map[i] ; j++ ) { 
sum = 0; 

for < k = 0; k < 16; k++ ) { 

sum (BF32)X bf [k] .real * (BF32) Corr bf ->real ; 
sum += <BF32)X_bf [k] .imag * {BF32) Corr_bf ->imag; 
++Corr bf ; 

} 

*R sums++ = sum; 
++num sums; 

} 

X bf += N FINGERS MAX_SQUARED; 



} 

Revision 
0.0 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Date Engineer Reason 

000906 fpl Created 



#include n salppc . inc " 

#define DO 10 1 
#define D0_P RE FETCH 1 

#if DO IO 

#define PTOV BUMP 1 1 

#define CORR BUMP 32 32 

#define CORR BUMP_64 64 

#define X BUMP 64 64 

#define RSUM BUMP 8 8 

#define RSUM_BUMP_4 4 
#else 

#define PTOV BUMP 1 0 

#define CORR BUMP 32 0 

#define CORR BUMP_64 0 

^define X BUMP 64 0 

#define RSUM BUMP 8 0 

#define RSUM_BUMP_4 0 
#endif 

#define LOAD_CORR( vT, rA, rB ) 

#define DST BUMP CORR BUMP 64 



LVX{ vT, xA, rB ) 



#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DSTJ3UMP ) 

#else 

#define PREFETCH ( rA, rB, STRM ) 
#endif 
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#def ine OLOOPJBIT 6 

/** 

Input parameters 
**/ 

#def ine X bf r3 
#define Corr bf r4 
#define Ptov map r5 
#def ine R sump r6 
#def ine Numjphys__users r7 
/** 

Local GPRs 
** j 

#def ine icount r8 
#def ine ptov count r9 
itdefine indxl rlO 
#define indx2 rll 
ftdefine indx3 rl2 
#define sindexl rl3 
#define dstp rl4 
^define dst_code r!5 
I ** 

G4 registers 
** j 

#define corr 00 vO 
#define corrOl vl 
#define corr 10 v2 
#define corrll v3 

#define CO 0 v4 
#def ine CI 0 v5 
#define CO 8 v6 
#define Cl_8 v7 

#define CO 16 v8 
#define CI 16 v9 
#define CO 24 vlO 
#define Cl_24 vll 

#def ine X0 vl2 

#define X8 vl3 

#define X16 vl4 

#define X24 vl5 

#define sumO vl6 
#define suml vl7 
ftdefine zero vl8 

/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY_5( gen_R_sums, X_bf, Corrjbf, Ptovjnap, R_sump, Numjphys_users > 

CMPWI( Num_phys_users , 0 ) 

BGT ( start ) 

RETURN 

LABEL { start ) 

SAVE rl3 rl5 

USEJTHRU vl8 ( VRSAVE COND ) 



MAKE_STREAM_CODE_IIR( dSt_code, DST_BUMP, 1, 0 ) 



DST setup 
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ADDI ( dstp, Corr bf, 80 ) /* start prefetch advanced */ 

PREFETCH { dstp, dst_code, 0 ) 

/** 

Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI( indxl, 16 ) 
LI( indx2, 32 ) 
LI( indx3, 48 ) 
LI ( sindexl, 4 ) 

CMPWI CR( OLOOP__BIT f Num phys_users, 0 ) 

LVX( corr 00, 0, Corr bf ) 

VXOR( zero, zero, zero ) 

LVX( corrOl, Corr bf, indxl ) 

LVX( corrlO, Corr bf, indx2 ) 

LVX( corrll, Corr_bf , indx3 ) 

VUPKHSB( CO 0, corr 00 ) 
ADDI ( Corr bf, Corr bf, CORR_BUMP_64 ) 
VUPKLSB ( CO 8, corrOO ) 
ADDI ( Ptov map, Ptov map, -PTOVJ3UMP_l ) 
VUPKHSB( CI 0, corrlO ) 
ADDI ( R sump, . R sump, -RSUM_BUMP_8 ) 
VUPKLSB( CI 8, corrlO ) 
VUPKHSB( CO 16, corr 01 ) 
VUPKLSB { CO 24, corrOl ) 
VUPKHSB( CI 16, corrll ) 
VUPKLSB ( Cl_24, corrll ) 
/** 

Outer loop for each physical user 
**/ 

LABEL ( oloop ) 

/* { */ 

DECR ( Num phys users ) 

LBZU( ptov count, Ptov_map, 1 ) 

BEQ CR( OLOOP BIT, ret ) 

LVX( X0, 0, X bf ) 

LVX( X8, X bf , indxl ) 

SRWI_C( icount, ptov count, 1 ) 

LVX( X16, X bf, indx2 ) 

LVX( X24, X_bf, indx3 ) 

ADDI ( X bf, X bf, X BUMP 64 ) 

CMPWI CR( OLOOP BIT, Num_phys_users , 0 ) 

BEQ__MINUS( one_sum ) 

/ * * 

Top of sum loop 
Produces two sums each pass 
**/ 

LABEL ( iloop ) 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
VMSUMSHS( sumO, CO 0, X0 , zero ) 
VMSUMSHS( suml, CI 0, X0, zero ) 
LVX( corrOO, 0, Corr bf ) 
LVX( corrOl, Corr bf, indxl ) 
LVX( corrlO, Corr bf, indx2 ) 
VMSUMSHS( sumO, C0_8, X8 , sumO ) 
DECR C( icount ) 

VMSUMSHS{ suml, CI 8, X8 , suml ) 
LVX( corrll, Corr bf, indx3 ) 
VUPKHSB( CO 0, corrOO ) 
VUPKLSB ( CO 8, corrOO ) 
VMSUMSHS{ sumO, CO 16, X16, sumO ) 
VMSUMSHS( suml, CI 16, X16, suml ) 
VUPKHSB( C1_0, corrlO ) 
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ADD I ( R sump, R sump, RSUM_BUMP_8 ) 

VUPKLSB( CI 8, corrlO ) 

VMSUMSHS( sumO, CO 24, X24, sumO ) 

VUPKHSB( CO 16, corrOl ) 

VMSUMSHS{ suml, CI 24, X24, suml ) 

VUPKLSB( CO 24, corrOl ) 

VUPKHSBC CI 16, corrll > 

VSUMSWS{ sumO, sumO, zero ) 

VUPKLSB{ CI 24, corrll ) 

VSUMSWS( suml, suml, zero ) 

ADDK Corr bf, Corr_bf, CORR_BUMP_64 ) 

VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sump ) 

VSPLTW( suml, suml, 3 ) 

STVEWX{ suml, R sump, sindexl ) 

/* ) */ 

BNE ( iloop ) 

/** 

Drop out, check for remainders 
** J 

ANDI_C(i count, ptov_count, 0x1) 
BEQ ( oloop ) 

/** 

One more sum: 

Enters and exits with two coor vectors are loaded and expanded to 16 bit 
**/ 

LABEL ( one sum ) 

VMSUMSHS( sumO, CO 0, X0, zero ) 
VMSDMSHS( sumO, CO 8, X8, sumO ) 
ADDI( R sump, R__sump, RSUM BUMP 8 ) 
VMSUMSHS( sumO, CO 16, X16, sumO ) 
VMSUMSHS( sumO, CO 24, X24 , sumO ) 
VSUMSWS( sumO, sumO, zero ) 



corr 00 consumed in one sum section 



VSPLTW ( sumO , sumO , 3 ) 
STVEWX{ sumO, 0, R sump ) 

ADDI( R_sump, R_sump, -RSUM_BUMP_4 ) /* pre-dec pointer for loop reentry 
*/ 

/** 

Seup for loop re-entry: 

loop exit ptr v 
corr 00 corr 10 corr 00 
corrOO corrlO 
loop re-entry ptr * 

-kief 

VMR( corrOO, corrlO ) 
LVX( corrlO, 0, Corr bf ) 
VMR( corrOl, corrll ) 
LVX( corrll, Corr bf, indxl ) 
ADDI( Corr bf, Corr bf, CORR BUMP 32 ) 



corrlO 
corrOO 



corrlO 



VUPKHSB( CO 0, corrOO ) 

VUPKLSB{ CO 8, corrOO ) 

VUPKHSB( CI 0, corrlO ) 

VUPKLSB( CI 8, corrlO ) 

VUPKHSB( CO 16, corrOl ) 

VUPKLSB{ CO 24, corr 01 ) 

VUPKHSB{ CI 16, corrll ) 

VUPKLSB( CI 24, corrll ) 

/* ) */ 

BR( oloop ) 
/** 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU Vl8 ( VRSAVE_COND ) 
REST rl3 r!5 
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RETURN 
FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: GEN R SUMS2.MAC 

Description: Multiple small dot product routine for wireless 

group application. 
Entry/params : GEN R SUMS2 (X bf, CorrO bf, Corrl bf, 

Ptovjmap, R_sums0, R_sumsl, Num_phys_users) 

Formula : 

num_sums = 0; 

for ( i = 0; i < Num phys users; i++ ) { 
for ( j = 0; j < (int) Ptov_map[i] ; j++ ) { 
sum = 0; 

for { k = 0; k < 16; k+-f ) { 

sumO += (BF32)X bf[k].real * (BF32)CorrO bf->real; 
sumO += (BF32)X_bf [k] .imag * (BF32) Corr0_bf ->imag ; 

suml (BF32)X bf[k].real * (BF32)Corrl bf->real; 

sural += (BF32)X_bf [k] .imag * (BF32) Corrl_bf ->imag; 

++Corr0 bf; 

++Corrl bf; 



*R sums0++ = sumO ; 
*R sumsl++ = suml; 
++num sums; 

} 

X bf += N FINGERS MAX SQUARED; 

} " ~ " ~ 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000906 fpl Created 

0.1 000908 fpl Fixed zero bug 



#include " salppc . inc " 

#define DO 10 1 
#define DO_PREFETCH 3 

#if DO 10 
#def ine 
#define 
#def ine 
#define 
#def ine 
#def ine 
#else 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#endif 



#define LOAD_CORR( vT, rA, rB ) LVX( vT, rA, rB ) 
#define DSTJ3UMP CORR_BUMP_64 
#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 



PTOV 


BUMP 


1 


1 


CORR 


BUMP 


32 


32 


CORR 


BUMP 


64 


64 


X BUMP 64 


64 


RSUM 


BUMP 


8 


8 


RSUM_ 


_BUMP_ 


4 


4 


PTOV 


BUMP 


1 


0 


CORR 


BUMP 


32 


0 


CORR 


BUMP 


64 


0 


X BUMP 64 


0 




RSUM 


BUMP 


8 


0 


RSUM 


BUMP 


4 


0 
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DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DST_BUMP ) 
#else 

#define PREFETCH C rA, rB, STRM ) 
#endif 

#define OLOOP_BIT 6 
/** 

Input parameters 



**/ 

#define X bf r3 

#define CorrO bf r4 

#define Corrl bf r5 

#define Ptov map r6 

#define R sumpO r7 

#define R sumpl r8 



#define Numjphys_users r9 
/ * * 

Local GPRs 
**/ 

#define icount rlO 
#define ptov count rll 
#define indxl rl2 
#define indx2 rl3 
#define indx3 rl4 
#define sindexl rl5 
#define dstp rl6 
#define dst code rl7 
#define dst stride indx3 
/** 

G4 registers 
**/ 

tfdefine corrOO vO 
#define corrOl vl 
#define corrlO v2 
ftdefine corrll v3 
ttdefine corr20 v4 
#define corr21 v5 
#define corr30 v6 
#define corr31 corrOO 
#define zero v7 

#define CO 0 v8 
ttdefine CI 0 v9 
tfdefine C2 0 vlO 
#define C3_0 vll 

tfdefine CO 8 vl2 
#define CI 8 vl3 
ftdefine C2 8 vl4 
#define C3_8 vl5 

#define CO 16 vl6 
#define CI 16 vl7 
#define C2 16 vl8 
#define C3_16 vl9 

#define CO 24 v20 
#define CI 24 v21 
#define C2 24 v22 
#define C3_24 v23 

#define X0 v24 
#define X8 v25 
#define XI 6 v26 
#define X24 v27 
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#define suraO v28 
ftdefine suml v29 
#define sum2 v30 
ftdefine sum3 v31 
/** 

Begin code text 
**/ 

FUNC_PROLOG 
#if 1 

NOP /****** alignment may be important ******/ 

#endif 

ENTRY_7( gen R sums2, X_bf, CorrOJbf, Corrl_bf, Ptovjnap, R_sump0, R_sumpl, 
Numj?hys__users ) 

CMPWI ( Numj?hys_users , 0 ) 

BGT ( start ) 

RETURN 

LABEL ( start ) 
SAVE rl3 rl7 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

DST setup 
**/ 

SUB( dst stride, Corrl bf, CorrO bf ) 

MAKE STREAM_CODE IIR( dst code, DST_BUMP, 2, dst_stride ) 

ADDI ( dstp, CorrO bf, 80 ) /* start prefetch advanced */ 

/* 48: 1087, 64: 1094, 80: 1043, 96: 1058, 112: 1049, 128: 1061 */ 
PREFETCH ( dstp, dst_code, 0 ) 

/** 

Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI( indxl, 16 ) 
LI ( indx2, 32 ) 
LI(„indx3, 48 ) 
LI (t'sindexl, 4 ) 

CMPWI_CR( OLOOP_BIT, Num_phys_users , 0 ) 

LOAD CORR( corrOO, 0, CorrO bf ) 

LOAD CORR( corrlO, CorrO bf, indx2 ) 

ADDI ( Ptov_map, Ptov map, -PTOV BUMP_1 ) 

LOAD CORR{ corr20, 0, Corrl bf ) 

ADDI ( R sumpO, R sumpO, -RSUM BUMP 8 ) 

LOAD_CORR( corr30, Corrl_bf, indx2 ) 

LOAD CORR( corrOl, CorrO bf, indxl ) 
ADDI { R sumpl, R sumpl, -RSUM BUMP_8 ) 
LOAD CORR( corrll, CorrOJbf, indx3 ) 
VXOR( zero, zero, zero } 
LOAD__CORR( corr21, Corrl_bf, indxl ) 

VUPKHSB( CO 0, corrOO ) 

ADDI ( CorrO bf, CorrOJbf, CORR BUMP 64 ) 
VUPKHSB( CI 0, corrl 0 ) 
VUPKHSB( C2 0, corr20 ) 
VUPKHSB( C3_0, corr30 ) 

VUPKLSB( CO 8, corrOO ) 

LOAD CORR{ corr31, Corrl_bf, indx3 ) /* corrOO, corr31 same register */ 
VUPKLSB( CI 8, corrl 0 ) 

ADDI ( Corrl_bf, Corrl_bf, CORR_BUMP_64 ) 
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VUPKLSBC C2 8, corr20 ) 
VUPKLSB( C3_8, corr30 ) 

VUPKHSB ( CO 16, corrOl ) 

VUPKHSB ( CI 16, corrll ) 

VUPKHSB ( C2 16/ corr21 ) 

VUPKHSB ( C3JL6, corr31 ) 

VUPKLSB( CO 24, corrOl ) 
VUPKLSB( CI 24, corrll ) 
VUPKLSB( C2 24, corr21 ) 
VUPKLSB( C3_24, corr31 ) 
/** 

Outer loop for each physical user 
**/ 

LABEL ( oloop ) 

/* { */ 

DECR( Num phys users ) 

LBZU( ptov count, Ptov_map, PTOV_BUMP_l ) 

BEQ CR( OLOOP BIT, ret ) 

LVX( X0, 0, X bf ) 

LVX( X8, X bf, indxl ) 

SRWI_C( icount, ptov count, 1 ) 

LVX( X16, X bf , indx2" ) 

LVX( X24, X_bf, indx3 ) 

ADDI ( X bf , X bf , X BUMP 64 ) 

CMPWI CR( OLOOP BIT, Num_phys_users, 0 ) 

BEQ_MINUS( one_sum ) 

/*+ 

Top of sura loop 
Produces four sums each pass 
**/ 

LABEL ( iloop ) 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
LOAD CORR( corrOO, 0, CorrOJof ) 
DECR C( icount ) 

LOAD CORR( corrlO, CorrO bf, indx2 ) 

VMSUMSHS( sumO, C0_0, XO, zero ) 

LOAD CORH{ corr20, 0, Corrl bf ) 

VMSUMSHS( suml, C1_0, X0, zero ) 

LOAD CORR( corr30, Corrl bf, indx2 ) 

LOAD CORR( corrOl, CorrO bf, indxl ) 

VMSUMSHS( sum2, C2_0, X0, zero ) 

LOAD CORR( corrll, CorrO bf, indx3 ) 

LOAD CORR( corr21, Corrl bf, indxl ) 

VMSUMSHS( sum3, C3 0, X0, zero ) 

VUPKHSB ( CO 0, corrOO ) 

VMSUMSHS( sumO, CO 8, X8 f sumO ) 

VUPKHSB ( CI 0, corrl 0 ) 

ADDI ( R sumpO, R surapO, RSUM_BUMP_8 ) 

VUPKHSB ( C2 0, corr20 ) 

VMSUMSHS( suml, CI 8, X8 , suml ) 

VUPKHSB ( C3 0, corr30 ) 

VMSUMSHS{ sum2, C2 8, X8, sum2 ) 

VUPKLSB( CO 8, corrOO ) 

VMSUMSHS( sum3, C3 8, X8, sum3 ) 

ADDI ( CorrO bf, CorrO bf, CORR BUMP 64 ) 

LOAD CORR( corr31, Corrl bf, indx3 ) /* corrOO, corr31 same register */ 

VUPKLSB( CI 8, corrlO ) " 

VMSUMSHS( sumO, CO 16, X16, sumO ) 

VUPKLSB( C2 8, corr20 ) 

VMSUMSHS( suml, CI 16, X16, suml ) 

VUPKLSB( C3 8, corr30 ) 

ADDI ( R sumpl, R sumpl, RSUM_BUMP_8 ) 

VUPKHSB ( CO 16, corrOl ) 

VMSUMSHS( sum2, C2_16, XI 6, sum2 ) 
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VUPKHSB( CI 16, corrll ) 

VMSUMSHS{ sum3, C3 16, X16, sum3 ) 

VUPKHSB( C2 16, corr21 ) 

VMSUMSHS( sumO, CO 24, X24, sumO ) 

ADDI ( Corrl bf, Corrl bf, CORR BUMP_64 ) 

VMSUMSHS( suml, CI 24, X24, suml ) 

VUPKHSB( C3 16, corr31 ) 

VMSUMSHS( sum2, C2 24, X24, sum2 ) 

VTJPKLSB( CO 24, corrOl ) 

VMSUMSHS( sum3, C3 24, X24, sum3 ) 

VSUMSWS( sumO, sumO, zero ) 

VUPKLSBC CI 24, corrll ) 

VSUMSWS( suna, suml, zero ) 

VUPKLSB( C2 24, corr21 ) 

VSUMSWSC sum2, sum2, zero ) 

VUPKLSB( C3 24, corr31 ) 

VSPLTW( sumO, sumO, 3 ) 

VSUMSWS( sum3, sum3, zero ) 

VSPLTW( suml, suml, 3 ) 

STVEWX( sumO, 0, R sumpO ) 

VSPLTW( sum2, sum2, 3 ) 

STVEWX( suml, R sumpO, sindexl ) 

VSPLTW{ sum3 r sum3, 3 ) 

STVEWX{ sum2, 0, R sumpl } • 

STVEWX{ sum3, R sumpl, sindexl ) 

/* } */ 

BNE ( iloop ) 

/** 

Drop out, check for remainders 
**/ 

ANDI_C(i count, ptov_count, 0x1) 
BEQ ( oloop ) 

/** 

One more sum: 

Enters and exits with two coor vectors are loaded and expanded to 16 bit 
**/ 

LABEL { one sum ) 

VMSUMSHS( sumO, CO 0, XO, zero ) 
ADDI ( R sumpO , R sumpO , RSUM BUMP_8 ) 
VMSUMSHS( sum2, C2 0, XO, zero ) 
ADDI < R sumpl, R sumpl, RSUM BUMP_8 ) 
VMSUMSHS( sumO, CO 8, X8, sumO ) 



C2 8, X8, sum2 



VMSUMSHS( sum2, 
VMSUMSHS( sumO, CO 16, X16, sumO ) 
VMSUMSHS( sum2, C2 16, X16, sum2 ) 
VMSUMSHS( sumO, CO 24, X24, sumO ) 
VMSUMSHS( sum2, C2 24, X24, sum2 ) 
VSDMSWS{ sumO, sumO, zero ) 
VSUMSWS( sum2, sum2 , zero ) 



/* 



VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sumpO ) 

VSPLTW( sum2, sum2, 3 ) 

STVEWX( sum2, 0, R sumpl ) 

ADDI ( R_sump0, R_sump0, -RSUM_BUMP_4 ) 

reentry */ 

ADDI ( R_sumpl, R_sumpl, -RSUM_BUMP_4 ) 



/* pre~dec pointers for loop 



Setup for loop re-entry: corrOO consumed in one__sum section 

exit ptr v 
corrOO corrlO corrOO corrlO 

corrOO corrlO 



*/ 



corrlO corrOO 
corrOO corrlO 
re-entry ptr 



VMR( corr21, corr31 ) 
VMR( corrOO, corrlO ) 



/* corrOO, corr31 same register */ 
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LOAD_CORR ( corrlO, 0, Corr0_bf ) 

VMR( corrOl, corrll ) 

LOAD_CORR{ corrll, Corr0_bf, indxl ) 

VMR( corr20, corr30 ) 

LOAD_CORR( corr30, 0 f Corrl_bf ) 

VUPKHSB( CO 0, corrOO ) 

VUPKLSB< CO 8, corrOO ) 

LOAD CORR( corr31, Corrl_bf, indxl ) /* corrOO, corr31 same register */ 

VUPKHSB( CI 0, corrlO ) 

VUPKLSB( CI 8, corrlO ) 

VUPKHSB( C2 0, corr20 ) 

VUPKLSB{ C2 8, corr20 ) 

VUPKHSB{ C3 0, corr30 ) 

VUPKLSB( C3_8, COrr30 ) 

VUPKHSB( CO 16, corrOl ) 

ADDI ( CorrO bf, CorrO bf, CORRJ3UMPJJ2 ) 

VUPKLSB( CO 24, corrOl ) 

ADDI ( Corrl bf, Corrl bf, CORR_BUMP_32 ) 

VUPKHSB( CI 16, corrll ) 

VUPKLSB( CI 24, corrll ) 

VUPKHSB( C2 16, corr21 ) 

VUPKLSB( C2 24, corr21 ) 

VUPKHSB( C3 16, COrr31 ) 

VUPKLSB ( C3 24, COrr31 ) 

/* } */ 

BR ( oloop ) 
/** 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_rl7 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN X ROW . MAC 

Description: 2 Complex scalers (4x1) 2 complex vectors (4xN) 
16 bit complex multiplication producing a 16 
bit complex vector of length 16*N. 

Entry/params : GEN_X_ROW (Al, A2, C, Phys_index, N) 

Formula: 

for ( i = 0; i < tot_phys_users; i++ ) { 

in mpathlp = mpathl bf + (i * N FINGERS MAX) ; 
in__mpath2p = mpath2_bf + (i * N_FINGERS_MAX) ; 



j - 0; 
for ( ql 

sir 
sli 
s2r 
s2i 



0; ql < N_FINGERS_MAX; ql++ ) { 

(BF32 ) out mpathlp [ql] . real ; 
(BF32 ) out mpathlp [ql] . imag; 
(BF32)out mpath2p[ql] .real; 
(BF32 ) out_mpath2p [ql] . imag ; 



for < q = 0; q < N_FINGERS_MAX ; q++ ) { 

air = (BF32)in mpathlp [q] .real; 

ali = (BF32)in mpathlp [q] .imag; 

a2r » (BF32)in mpath2p[q] .real; 

a2i = (BF32) in_mpath2p [q] .imag; 

cr = (air * sir) + (ali * sli) ; 
ci =s (air * sli) - (ali * sir) ; 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci += (a2r * s2i) - (a2i * s2r) ; 

X_bf[i * N_FINGERS_MAX_SQUARED + jl.real 

= (BF16) (cr » 16) ; 

X_bf[i * N_FINGERS_MAX_SQUARED + j].imag 

= (BF16) (ci » 16) ; 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date Engineer Reason 

0.0 000907 fpl Created 



#include "salppc.inc" 

#define LOG N FINGERS MAX 2 
#define LOG ELEMENT_SIZE 2 

#define INDEXES HI FT (LOG_N_FINGERS_MAX + LOG_ELEMENT_S I ZE ) 



Local read-only Permute vector table 
**/ 



RODATA_SECTION( 6 ) 
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START_L_ARRAY ( localjiable ) 

L_PERMUTEJ4ASK{ 0x02031011, 0x06071415, 0x0a0bl819, OxOeOflCld ) 
/** 

32 -> 16 bit: select the 16 MSBs of each 32 bit field 
**/ 

L_PERMUTE_MASK( 0x00011011, 0x04051415, 0x08091819, OxOcOdlcld ) 

END_ARRAY 

/** 

API registers 
**/ 

#define Al r3 
#define A2 r4 
#define C r5 
#def ine Phys_index r6 
^define N r7 
/** 

Integer loop registers 
**/ 

#define CpO C 
#define Cpl r8 
#define sptrl r8 
#define Cp2 r9 
#define sptr2 r9 
#define Cp3 rlO 
#define tptr rlO 
#define cindex rll 
#def ine aindex rl2. 
#define index rl2 

/** 

G4 registers 
**/ 

#define crOO vO 
#define crOl vl 
#define cr02 v2 
ftdefine cr03 v3 

#define vtmpO vO 
#define vtmp2 v2 

#define ciOO v4 
#define ciOl v5 
#define ci02 v6 
#define ci03 v7 

#define srOO v8 
#define srOl v9 
#define sr02 vlO 
#define sr03 vll 

#define siOO vl2 
#define siOl vl3 
#define si02 vl4 
#define si03 vl5 

#define srlO vl6 
#define srll vl7 
#define srl2 vl8 
ttdefine srl3 vl9 

#define silO v20 
#define sill v21 
#define sil2 v22 
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#def ine sil3 v23 

#define cO v24 
#define cl v24 
#define c2 v25 
#define c3 v26 

#define aOO v27 
^define alO v27 
#define aOl v28 
#define all v29 

#define sval v28 
#define neg_sval v29 

tfdefine vc v30 
#def ine zero v31 

/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY 5( gen X row, Al, A2, C, Phys_index, N ) 

USE THRU V31 ( VRSAVE COM) ) 
/** " ~ 
Load up complex scaler 

sval = srO siO srl sil sr2 si2 sr3 si3 
+ + / 

LA( tptr, local table, 0 ) 
VXOR( zero, zero, zero ) 
LI (index, 0) 

/** 

Byte offset into 16 bit complex vector 
**/ 

SLWI< Phys index, Phys index, INDEX_SHIFT ) 
ADD ( sptrl, Al, Phys index ) 
ADD ( sptr2, A2, Phys index ) 

I * * 

Load up first scaler: 

if sval = sr0,si0 srl, sil sr2,si2 sr3,si3 
= sO si s2 S3 

**/ 

LVX( sval, sptrl, index ) /* read 4 16 bit complex values */ 
VSUBSHS( neg sval, zero, sval ) /* negate complex scaler values */ 
VMRGHW (vtmp 0 , sval, sval) /* vtmpO = sO sO si si */ 

VMRGLW(vtmp2, sval, sval) /* vtmp2 = s2 s2 s3 s3 */ 

VMRGHW(sr00, vtmpO, vtmpO) /* srO = sO sO sO sO */ 
VMRGLW(sr01, vtmpO, vtmpO) /* srl = si si si si */ 
VMRGHW(sr02, vtmp2, vtmp2) /* sr2 = s2 s2 s2 s2 */ 
VMRGLW(sr03, vtmp2 , vtmp2) /* sr3 = s3 S3 s3 s3 */ 

I ** 

if neg sval = sr0,si0 srl, sil sr2,si2 sr3,si3 
after perm: 

= si0,-sr0 sil, -srl si2,-sr2 si3,-sr3 

= nsO nsl ns2 ns3 

**/ 

LVX( vc, tptr, index ) 

VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW ( vtmpO, neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW(vtmp2, neg sval, neg sval) /* vtmp2 = ns2 ns2 ns3 ns3 */ 
VMRGHW(si00, vtmpO, vtmpO) /* siO = nsO nsO nsO nsO */ 
VMRGLW(si01, vtmpO, vtmpO) /* sil = nsl nsl nsl nsl */ 
VMRGHW(si02, vtmp2, vtmp2) /* si2 « ns2 ns2 ns2 ns2 */ 
VMRGLW(si03, vtmp 2 , vtmp2) /* si3 = ns3 ns3 ns3 ns3 */ 

/** 

Load up second scaler: 
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**/ 

LVX( sval, sptr2, index ) /* read 4 16 bit complex values */ 
ADDI ( index , index , 16) 

VSUBSHS( neg sval, zero, sval ) /* negate complex scaler values */ 

VMRGHW(vtmpO, sval, sval) /* vtmpO « s0 sO si si */ 

VMRGLW(vtmp2, sval, sval) /* vtmp2 = s2 s2 s3 s3 */ 

VMRGHW(srlO, vtmpO, vtmpO) /* srO = sO sO SO sO */ 

VMRGLW(srll, vtmpO, vtmpO) /* srl = si si si si */ 

VMRGHW(srl2, vtmp2, vtmp2) /* sr2 « S 2 s2 s2 s2 */ 

VMRGLW(srl3, vtmp2, vtmp2) /* sr3 = s3 s3 s3 s3 */ 

VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW(vtmpO, neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW(vtmp2, neg sval, neg sval) /* vtmp2 = ns2 ns2 ns3 ns3 */ 
VMRGHW{silO, vtmpO, vtmpO) /* siO = nsO nsO nsO nsO */ 
VMRGLW(sill, vtmpO, vtmpO) /* sil = nsl nsl nsl nsl */ 
VMRGHW(sil2, vtmp2 f vtmp2) /* si2 = ns2 ns2 ns2 ns2 */ 
VMRGLW(sil3, vtmp2, vtmp2) /* si3 = ns3 ns3 ns3 ns3 */ 

I ** 

Assign loop pointers and index registers; 

Loop permute control vector assumes 16 bit input vectors 

C[3 -> 16 x N complex elements 

A [3 -> 4 x N complex elements 

N -> 4 byte (i.e. interleaved complex) elements 
**/ 

LVX( vc, tptr, index ) /* interleaves 16 MSBs of real, imaginary */ 

LI (aindex, 0) 

LI {cindex, 0) 

ADDI ( Cpl, C, 16 ) 

ADDI ( Cp2, C, 32 ) 

ADDI { Cp3, C, 48 ) 

I ** 

Start up loop code: 

Each read on A[] brings in 4 complex input values 
** i 

LVX{ aOO, Al, aindex ) 
DECR_C (N) 

LVX{ aOl, A2, aindex ) 
ADDI (aindex, aindex, 16) 



VMSUMSHS{ crOO, 
VMSUMSHS( ciOO, 
VMSUMSHS{ crOl, 
VMSUMSHS( ciOl, 
VMSUMSHS( cr02, 
VMSUMSHS( ci02, 
VMSUMSHS( cr03, 
VMSUMSHS( ci03, 
BEQ ( dol ) 



sr00, aOO, zero ) 

siOO, aOO, zero ) 

srOl, aOO, zero ) 

siOl, aOO, zero ) 

sr02, aOO, zero ) 

si02, a00, zero ) 

sr03, aOO, zero ) 

si03, aOO, zero ) 



DECR_C(N) 

LVX( al0 f Al, aindex ) /* read input for next pass */ 

VMSUMSHS{ crOO, srlO, aOl, crOO ) 

VMSUMSHS( ciOO, silO, aOl, ciOO ) 

LVX( all, A2, aindex ) 

VMSUMSHS( crOl, srll, aOl, crOl ) 

BR ( mid_loopO ) 

/** 

Top of double loop 
**/ 

LABEL ( loopO ) 
/* { */ 

VMSUMSHS( crOO, srOO, aOO, zero ) 
VMSUMSHS( ciOO, siOO, aOO, zero ) 
VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS( crOl, srOl, aOO, zero ) 



317 



WO 02/073937 



PCT/US02/08106 



gen_x__row . mac 
DECR C(N) 

VMSUMSHS ( ciOl, siOl, a00 r 
VMSUMSHS( cr02, sr02, aOO, 
VMSUMSHS( ci02, si02, aOO, 
VPERM( c3 r cr03 r ci03, vc ) 
STVX{ c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03 , aOO, 
VMSUMSHS ( ci03, si03, aOO, 
LVX( alO, Al, aindex ) /* 
VMSUMSHS( crOO, srlO, aOl, 
VMSUMSHS( ciOO, silO, aOl, 
LVX{ all, A2, aindex ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS ( crOl, srll, aOl, 
ADDI (cindex, cindex, 64) 
LABEL ( mid loopO ) 

VMSUMSHS ( CiOl, sill, aOl, 
VMSUMSHS* cr02, srl2, aOl, 
VPERM{ cO, crOO, ciOO, vc ) 
STVX( cO, CpO, cindex ) /* 
VMSUMSHS( ci02, sil2, aOl, 
ADDI (aindex, aindex, 16) 
VMSUMSHS ( cr03, srl3 , aOl, 
VMSUMSHS ( ci03, sil3, aOl, 
VPERM( cl, crOl, ciOl, vc ) 

/* } */ 

BNE ( loopl ) 
/** 

Drop out to flush 
** J 

VMSUMSHS ( crOO, 
VMSUMSHS ( ciOO, 
VPERM( c2, cr02 
STVX( cl, Cpl, cindex 
VMSUMSHS ( crOl, srOl, 
VMSUMSHS ( ciOl, 
VMSUMSHS ( cr02, 
VMSUMSHS ( ci02, 
VPERM( c3, cr03 
STVX( c2, Cp2, cindex ) 
VMSUMSHS ( cr03, sr03, alO, 
VMSUMSHS ( ci03, si03, alO, 
VMSUMSHS ( crOO, srlO, all, 
VMSUMSHS ( ciOO, silO, all, 
STVX( c3, Cp3, cindex ) 
VMSUMSHS { crOl, srll, all, 
ADDI (cindex, cindex, 64) 
VMSUMSHS ( ciOl, sill, all, 
VMSUMSHS ( cr02, srl2, all, 
VPERMC cO, crOO, ciOO, vc ) 
STVX( cO, CpO, cindex ) / + 
VMSUMSHS ( ci02, sil2, all, 
VMSUMSHS ( cr03, srl3, all, 
VMSUMSHS ( ci03, sil3 , all, 
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zero ) 
zero ) 
zero ) 



zero ) 
zero ) 

read input for next pass */ 
crOO ) 
ciOO ) 



crOl ) 



ciOl ) 
cr02 ) 

/* begin permute cycle for this pass */ 
begin write cycle from last pass */ 
ci02 ) 

cr03 ) 
ci03 ) 



srOO, 
siOO, 
. ci02, 



siOl, 
sr02, 
si02, 
, ci03 



alO, 
alO, 
vc ) 

) 

alO, 
alO, 
alO, 
alO, 
vc ) 



zero ) 
zero ) 



zero ) 
zero ) 
zero ) 
zero ) 



VPERMC cl, crOl, ciOl, vc ) 



zero ) 
zero ) 
crOO ) 
ciOO ) 

crOl ) 

ciOl ) 
cr02 ) 

/* begin permute cycle for this pass */ 
begin write cycle from last pass */ 
ci02 ) 
cr03 ) 
ci03 ) 



VPERM( c2, cr02, ci02, vc ) 
STVX{ cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR< ret ) 

/** 

Top of second loop 
**/ 

LABEL ( loopl ) 
/* { */ 
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VMSUMSHS( crOO, srOO, alO, zero ) 
VMSUMSHS( ciOO, siOO, alO, zero ) 
VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS( crOl, srOl, alO, zero ) 
DECR C(N) 

VMSUMSHS( ciOl, siOl, alO, zero ) 
VMSUMSHSC cr02, sr02, alO, zero ) 
VMSUMSHS( ci02, si02, .alO, zero ) 
VPERM< c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, alO, zero ) 
VMSUMSHS( ci03, si03, alO, zero ) 

LVX( aOO, Al, aindex ) /* read input for next pass */ 

VMSUMSHS( crOO, srlO, all, crOO ) 

VMSUMSHS( ciOO, silO, all, ciOO ) 

LVX( aOl, A2, aindex ) 

STVX( c3, Cp3, cindex ) 

VMSUMSHS( crOl, srll, all, crOl ) 

ADDI (cindex, cindex, 64) 

VMSUMSHS( ciOl,. sill, all, ciOl ) 

VMSUMSHS( cr02, srl2 / all, cr02 ) 

VPERM{ cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS( ci02, sil2, all, ci02 ) 

ADDI (aindex, aindex, 16) 

VMSUMSHS( cr03, srl3, all, cr03 ) 

VMSUMSHS( ci03, sil3, all, ci03 ) 

VPERM( cl, crOl, ciOl, vc ) 

/* } */ ~ 

BNE ( loopO ) 
/** 

Flush loop 

**/ 

VMSUMSHS( crOO, srOO, aOO, zero ) 
VMSUMSHS( ciOO, siOO, aOO, zero ) 

VPERM( c2, cr02, ci02, VC ) ' 
STVX( cl, Cpl, cindex ) 
VMSUMSHS( crOl, srOl, aOO, zero ) 
VMSUMSHS( ciOl, siOl, aOO, zero ) 
VMSUMSHS( cr02, sr02, aOO, zero ) 
VMSUMSHS( ci02, si02, aOO, zero ) 
VPERM( C3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, aOO, zero ) 
VMSUMSHS( ci03, si03, aOO, zero ) 
VMSUMSHS( crOO, srlO, aOl, crOO ) 
VMSUMSHS( ciOO, silO, aOl, ciOO ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS( crOl, srll, aOl, crOl ) 
. ADDI (cindex, cindex, 64) 
VMSUMSHS( ciOl, sill, aOl, ciOl ) 
VMSUMSHS( cr02, srl2, aOl, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 
STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 
VMSUMSHS( ci02, sil2, aOl, ci02 ) 
VMSUMSHS( cr03, srl3, aOl, cr03 ) 
VMSUMSHS( ci03, sil3, aOl, ci03 ) 
VPERM( cl, crOl, ciOl, vc ) 



VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR( ret ) 
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LABEL { dol ) 

VMSDMSHS( crOO, srlO, aOl, crOO ) 
VMSUMSHS( ciOO, silO, aOl, ciOO ) 
VMSUMSHS{ crOl, srll, aOl, crOl ) 
VMSOMSHS( ciOl, sill, aOl, ciOl ) 
VMSUMSHS( cr02, srl2, aOl, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS( ci02, sil2 # aOl, ci02 ) 

VMSUMSHS( cr03, srl3, aOl, cr03 ) 

VMSUMSHS( ci03, sil3, aOl, ci03 ) 

VPERM( cl, crOl, ciOl, vc ) 

VPERM( c2, cr02, ci02, vc ) 

STVX( cl f Cpl, cindex ) 

VPERM( c3, er03, ci03, vc ) 

STVX( c2, Cp2, cindex ) 

STVX{ c3, Cp3, cindex ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU_v31( VRSAVE_COND ) 

RETURN 
FUNfc EPILOG 
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#include "mudlib. h" 
/* 

* Return the offset in units of complex elements into the CorrO matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 
*/ 

int mudlib get CorrO offset ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot virt users, /* sum of ptov_map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_jnap [start__phys_user] 

*/ 

{ ) 

int num_Corrs, num_yirt_users ; 

num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
start_phys_user, 

~~ start_virt_user ) - 1; 

nura_Corrs - (num virt users * tot virt users) - 

( (num_virt_users * (num_virt_users + 1)) / 2); 

return ( num Corrs * (num fingers * num fingers) ) ; 

} 

/* 

* Return the size (in bytes) of the portion of the CorrO matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of CorrO are assumed 

* to be of type C0MPLEXJBF8 . 
*/ 

int mudlib get CorrO size ( 



int 


num fingers, 


/* 


int 


tot_virt_users , 


/* 


*/ 




int 


start phys user, 


/* 


int 


s tart_yirt_user , 


/* 


*/ 




int 


end phys user, 


/* 


int 


e ndjvi r t_u s er 


/* 



} 



) 

int start_of f set, end_offset; 

start_offset = mudlib_get_Corr0_of f set ( ptov map, 

num fingers, 
tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptov_map, end_j)hys_user, end__virt_user ) 

end_offset = mudlib_get_CorrO__of f set ( ptov map, 

num fingers, 
tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof (C0MPLEXJBF8) ); 
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/* 

* Return the offset in units of complex elements into the Corrl matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 

*/ 

int mudlib get Corrl offset ( 

unsigned char *ptov_map f /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot virt_users, /* sum of ptov__map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys__user] 

*/ 

) 

int num_Corrs , num_virt_users ; 

num virt users = mudlib_get_num_yirt_users ( ptovjnap, 0, 0, 
start jphys_user , 

start_virt_user ) - 1; 
num_Corrs = (num_yirt_users * tot_virt_users) ; 

return ( num Corrs * (num fingers * num_fingers) ) ; 
} " 

/* 

* Return the size (in bytes) of the portion of the Corrl matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Corrl are assumed 

* to be of type COMPLEX_BF8. 
*/ 

int mudlib get Corrl size ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptovjnap over all phys users 

*/ 

int start phys user, /* zero -based index into ptov map */ 

int start_virt_user, /* must be < ptovjnap [start_phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt__user /* must be < ptov_map [end_phys__user] */ 

) 

int start_of f set, end_offset,- 

start_offset = mudlib_get_Corrl_of f set ( ptov map, 

num fingers, 
tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER( ptovjnap, endjphys_user , end_virt_user ) 

end_offset = mudlib_get_Corrl_of f set ( ptov map, 

num fingers, 
tot virt users, 
end phys user, 
end virt user ) ; 



return ( (end_pffset - start_of f set) * sizeof (C0MPLEX_BF8) ); 



* Return the offset into the R0 matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 
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int 



mudlib get RO offset ( 

unsigned char *ptov_map, 
int tot virt_users, 



) 



*/ 
int 
int 
*/ 



start phys user, 
start virt user 



/* no more than 256 virts . per phys */ 

/* sum of ptov_map over all phys users 

/* zero -based index into ptov map */ 

/* must be < ptov_jnap [start_j>hys_user] 



int i , num__virt_users, offset , tools ; 

tools = (tot virt users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN_MASK; 
num virt users = mudlib_get_nuTn_virt_users ( ptov_map, 0, 0, 
start_phys_user , 

start_virt_user ) - 1; 

offset m 0; 

for ( i = 0; i < num_jvirt_users; i++ ) 

offset += (tools - (i Sc ~R_MATRIX_ALIGN_MASK) ) ; 
return offset ; 



Return the size (in bytes) of the portion of the RO matrix 

corresponding to a specified starting physical user, virtual 

user (within the starting physical user) pair and an ending physical 

user, virtual user pair, inclusive. Elements of R0 are assumed 

to be of type BF8. 



*/ 
int 



mudlib get R0 size ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 



int 
V 
int 
int 
V 
int 
int 



tot_virt_users , 

start phys user, 
start_virt__user, 

end phys user, 
end virt user 



) 



/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov__map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [end_phys__user] */ 



int start__of f set , end__offset; 

start_offset = mudlib_get_R0_of f set ( ptov map, 

tot virt users, 
start phys user, 
start_virt__user ) ; 

MUDLIB_INCR_VIRT_USER ( ptov__map, end_phys_user , end_virt_user ) 

end_offset ~ mudlibjget_R0_of f set ( ptov map, 

tot virt users, 
end phys user, 
end_virt__user ) ,- 

return ( (end offset - start offset) * sizeof(BF8) ); 



/* 

* Return the offset into the Rl matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 
*/ 

int mudlib get Rl offset ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt users, /* sum of ptov_map over all phys users 

*/ 

int start_phys_user, /* zero-based index into ptovjnap */ 
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int start_yirt_user /* must be < ptovjnap [start jphys_user] 

*/ 

) 

{ 

int num_vi r t_us er s , t col s ; 

tcols. = (tot virt users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGNJ4ASK; 
num virt users = mudlib_get jium_virt_users ( ptovjnap, 0, 0, 
s t a r t jphy s_us e r , 

startjvirt_user ) - 1; 

return ( num virt users * tcols ) ; 
} ~ " 

/* 

* Return the size (in bytes) of the portion of the Rl matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Rl are assumed 

* to be of type BF8. 
V 

int mudlib get Rl size ( 

unsicmed char *otov maD. /* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 



int 


tot_yir t_users , 


/* 


V 






int 


start phys user, 


/* 


int 


s t a r t__v i r t_u s er , 


/* 


*/ 




int 


end phys user, 
end_virt_user 


/* 


int 


/* 



) 

{ 

int start_of f set, endjsf f set ; 
start_offset = mudlib_get JR.l_of f set ( ptov map, 

tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptov_map, end_phys_user, end_yirt_user ) 

end_offset = mudlib_jget_Rl_of f set ( ptov map, 

tot virt users, 
end phys user, 
end_virt__user ) ; 

return ( (end offset - start offset) * sizeof (BF8) ) ; 

} 

/* 

* Return the number of virtual users 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. 
*/ 

int mudlib get num_virt_users { 

unsigned char *ptov map, /* no more than 256 virts. per phys */ 
int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptovjnap [start_phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_yirt_user /* must be < ptovjnap [end_j?hys_user] */ 

{ 5 

int i , num_vir t__use r s ; 

if ( start_phys user == end phys user ) 

return ( end_virt_user - start virt user + 1 ) ; 
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else { 

num_virt users = ptov map [start phys_user] - start virtjiser; 
for { i = (start_phys user + 1) ; i < end_phys__user ; i++ ) 

num virt users += ptov map[i]; 
num virt_users += (end virt_user + 1) ; 
return ( num_virt_users ) ; 



* For a specified starting physical user, virtual user 

* (within the starting physical user) pair and a specified 

* number of virtual users inclusive of the starting pair, 

* return (in separate arguments) , the corresponding ending 

* physical user, virtual user pair (inclusive) . 
*/ 

void mudlib get end userjpair ( 

unsigned char *ptov map, /* no more than 256 virts. per phys */ 
int start phys user, /* zero-based index into ptov map */ 

int start_yirt_user, /* must be < ptov_map [start_phys__user] 

int num virt users, /* number from start (must be > 0) */ 

int *end phys user, /* zero-based index into ptov map */ 

int *end_virt__user /* will be < ptovjnap [*end_j>hys_user] * 



{ 



} 



int i, j; 

for ( i = start phys user; ; i++ ) { 

for ( j = start virt user; j < ptov map[i] ; j++ ) 

if ( --num virt users == 0 ) break; 
if ( num virt users == 0 ) break; 
start virt user = 0; 

} ~ " 

*end phys user = i; 
*end_virt_user = j ; 
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#include "mudlib. h" 

/********************************************************************** 
* Virtual users version 

**********************************************************************/ 

int mudlib get CorrO offset v ( 

unsigned char *ptov_map / /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users , /* sum of ptov_jnap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

) 

{ 

int i, num_f ingers_squared, remaining_size, skipped_virt_users, 
total_size; 

num fingers squared = num_fingers * num_f ingers ; 
skipped_virt_users = 0; 

for { i = 0; i < start phys user; i++ ) 
skipped__virt_users += (int) ptov__map [i] ; 

skipped_virt_users += start_virt_user; 

// Always even 

total size = tot_virt users * ( tot_yirt users - 1 ) ; 
remaining_size = ( tot virt users - skipped virt users ) 

* ( tot_virt_users - skipped_virt_users - 1 ) ; 

// zero based units of complex elements 

return ( num_f ingers_squared * { { total_size - remaining_size ) » 1 ) ) ; 

int mudlib get Corrl offset v ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_yirt_users r /* sum of ptov_map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov map [startphys user] 

*/ 

) 

{ 

int i, num_f ingers_squared, skipped_virt_users; 

num fingers squared = num_fingers * num_fingers,- 
skipped_virt_users =0; 

for ( i = 0; i < start phys user; i++ ) 
skipped_virt_users += (int) ptov jnap [i] ; 

skipped_virt_users += start_virt_user ; 

return ( num__f ingers_squared * ( skipped_virt_users * tot_virt_users ) ) ; 



int mudlib get RO offset_v { 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users , /* sum of ptovjnap over all phys users 

*/ 
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int start phys user, /* zero -based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

{ 

int i , iv ; 

int RO_skipped__virt — users , R0_tcols, tcols, size; 

tcols = (tot_virt_USers + R_MATRIX__ALIGN_MASK ) & ~ RJKATR I X_AL I GN_MAS K ; 
R0 — skipped_yirt_users = 0; 
size = 0; 

for ( i = 0; i < start phys user; i++ ) { 

for ( iv = 0; iv < (int) ptov_map [i] ; iv+4 ) { 

R0_tcols = tools - (R0_skipped_virt_users & ~ R_MATR I X_AL I GN_MAS K ) ; 

size += R0 tools; 

++R0 skipped virt users; 

/* Handle last physical user, potentially split on virt users */ 

for ( iv = 0; iv < (int) start_yirt_user; iv++ ) { 

R0_tcols = tools - (R0_skipped_virt_users & ~ R_MATRIX_AL IGN_MASK) ; 

size += R0 tools; 
++R0_skipped__virt_users,- 

return size; 

} 

int mudlib get R0 size y ( 

)tov map, , . . . K _ ^ _ a ... 

/* sum of ptov__map over all phys users 



unsigned char *ptov map, 


/* 


int 


tot_virt_users , 


/* 


*/ 






int 


start phys user, 


/* 


int 


start_virt__user f 


/* 


*/ 






int 


end phys user, 


/* 


int 


end_virt_user 


/* 



int i, iv; 

int RO_skipped_virt_users, R0_tcols, tcols, size; 

tcols = <tOt_yirt_users + R_MAT R I X_AL I GN_MAS K ) & ~ R_MATR I X_AL I GN_MAS K ; 

R0 skipped virt users = 0; 

for ( i = 0; i < start phys user; i++ ) 

RO_skipped_virt_users += (int) ptov_map [i] ; 

R0_skipped_yirt_users += start_virt_user ,- 

// pr int f ("skipped: %d\n", R0_skipped_virt_users) ; 

size * 0; 

if ( start_phys_user == end_phys_user ) 
// printf ("start == end phys\n"); 
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// <= for Inclusive 

for ( iv = start_virt_user ; iv <= (int) end_virt__user ; iv++ ) { 

R0_tcols = tools - (RO_skipped_yirt_users & ~R_MATRIX_ALIGN_MASK ) ; 
size += R0 tcols; 

// printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 
++R0_skipped_virt_users ; 

} } 
else 

for { i = start_phys_user; i < end phys user; i++ ) { 
for ( iv = 0; iv < (int)ptov_map[i] ; iv++ ) { 

R0_tcols = tcols - (R0_skipped_yirt_users & ~ R__MATRI X_AL I GN_MAS K ) ; 

size += R0 tcols; 

// printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 
^ ++RO_skipped_virt_users ; 

/* Handle last physical user, potentially split on virt users */ 
// printf ( "last phys user \n"); 
// <- for Inclusive 

for ( iv = start_virt_user; iv <= (int) end_virt_user ; iv++ ) { 

R0_tcols = tcols - (RO_skipped_virt_users & ~R_MATRIX_ALIGN_MASK) ; 
size += R0 tcols; 

// printf ("size: %d, ROtc: %d\n" , size, R0_tcols) ; 

++RO_skipped_virt_users ; 

} } 

return size; 

} 

int mudlib get Rl offset_v ( 

unsigned char *ptov__map, /* no more than 256 virts. per phys */ 
int tot_virt_users , /* sum of ptov map over all phys users 

V 

int start phys user, /* zero- based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

{ 

int i, tcols, virt_users; 

tcols = (tot__virt_users + R_MATRIX_AL I GN__MAS K ) & ~ R_MATR I X_AL I GN_MAS K ; 
virt__users = 0; 
// Main loop 

for ( i = 0; i < start phys user; i++ ) { 
virt_users += (int) ptov_map [i] ; 

// Trailing virtual users 
virtjjsers += start_yirt_user; 

return ( virt_users * tcols ) ; 
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int mudlib get Rl size y { 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int tot_virt_users r /* sum of ptov map over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start__phys_userj 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /+ must be < ptov_map [end_phys_user] */ 

{ 

int i, tools, virt_users; 

tools « (tot_virt_users + R_MAT R I X_AL I GN_MA S K ) & ~R_MATRIX_ALIGN_MAS K ; 
virt_users = 0; 

if ( start_phys_user end_phys_user ) 

virt_users = end_virt_user - start_virt_user + 1; 
else if (start_phys_user < end_ phys_user) 

// Leading virtual users 

virt_users = (int) ptoyjnap [start _jphys_user] - start_virt_user; 
// Main loop 

for ( i = (start phys user + 1) ; i < end_phys_user ; i++ ) 
virt_users += (int) ptov_map [i] ; - 

// Trailing virtual users 
^ virt_users += (end_virt_user + 1) ; 

^ return ( virt_users * tcols ) ; 
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#define 10 1 
#define TIME 0 



// Asynchronous MPIC 
// 

#if TIME 

#include <tmr.h> 
#endif 

#include "mudlib.h" 

void sve3_8bit( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ); 

void dotpr3_8bit{ BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) ; 

void dotpr6_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) ; 

void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) ; 

#if TIME 

static int time_count = 0; 

static int z; 

static float time; 

static TMR ts timet), timel; 

static TMR_timespec elapsed; 

#endif 



( BF8 *Bt hat, BF8 *R0 hat, 
BF8 *R1 hat, BF8 *Rlm hat, 
BF32 *Y, BF32 Ythresh, 
int N_users, int N_bits, int N_stages ) 

* N users must be > 0 and divisible by 4 

* N bits must be >= 5 

*/ " 

void mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) 

{ 

BF8 *Bt hatp; 

BF8 *R0 hatp, *Rl_hatp, *Rlm_hatp; 
BF32 *Yp; 

BF32 R bias, sums[3]; 

int hat_tc, i, m, N_usersjpad, stage; 

hat tc = (N_users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN MASK; 
N_usersjpad = (N_users + ALTIVEC__ALIGN_MASK) & ~ALTIVEC_ALIGN_MASK; 

#if 0 

if ( ( (long)Bt hat | (long)RO upper bf | (long)RO lower bf | 

(long)Rl trans bf | (long) Rim bf) & ALTIVEC ALIGN_MASK ) { 
printf ( ****** inputs are NON-ALIGNED *****\ n » ); 
exit{ -1 ) 



/* 

* void async multirate mpic 
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} 

#endif 
// 

// Subtract interference in N stages 

for { stage = 0; stage < N_stages; stage++ ) { 

R0 hatp = RO hat; 
Rl hatp = Rl hat; 
Rim hatp = Rlm_hat; 
Yp *= Y; 

for ( i = 0; i < N_users; i++ ) { 

sve3_8bit( R0_hatp, Rl_hatp, Rlmjiatp, &R_bias, N_usersjpad ); 

#if 0 

R0_hatp[i] = BF8_ZERO; /* zero diagonal element */ 

#endif 

Bt hatp = Bt_hat + hat_tc; /* points to leading row */ 

tn = 2; 

while { m <- (N bits-4) ) { 

if ( BFABS ( Yp[m] ) < Ythresh ) f 
if ( " BFABS ( Yp[m+1] ) < Ythresh ) { 
if ( BFABS ( Yp[m+2] ) < Ythresh ) { 

dotpr9_8bit( Bt hatp, Rl hatp, R0 hatp, Rlm_hatp, 

suras, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

sums[l] -= ( (BF32) Bt hatp[hat tc + i] * (BF32) Rljiatp [i] ) ; 
if ( (Yp[m] - sums [03 > > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatp [hat tC + i] = -1 + BIAS 8BIT; 
sumsll] += ( (BF32)Bt_hatp[hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [U -= R bias; 

sums [2] -a { (BF32) Bt hatp[2*hat tc + i] * (BF32) Rl_hatp [i] ) ; 
if ( (Yp[m+13 - sumstl]) > BF32 ZERO ) 

Bt_hatp[2*hat_tc + i3 » 1 + BIAS_8BIT; 
else 

Bt hatp[2*hat tc + i] = -1 + BIAS 8BIT; 
sums [2] += ( (BF32)Bt_hatp[2*hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [2] -= R bias; 

if ( (Yp[m+2] - sums [2]) > BF32 ZERO ) 

Bt_hatp[3*hat_tc + i3 = 1 + BIAS_8BIT; 
else 

Bt_hatp[3*hat_J;c + i] = -1 + BIAS_8BIT; 

else { /* skip third sum */ 

dotpr6_8bit( Bt hatp, Rl hatp, R0 hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] R bias; 

sums[l3 -« ( (BF32) Bt hatpthat tc + i3 * (BF32) Rl_hatp [i] ) ,- 
if ( (Yp[m) - sums [03) > BF32 ZERO ) 

Bt_hatp[hat_tc + i3 » l + BIAS_8BIT ; 
else 

Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 
sums[l3 += ( (BF32)Bt_hatp[hat_tc + ij * (BF32) Rl_hatp [i] ) ; 

sums [13 -= R bias; 

if ( (Yp[m+1] - sums [13) > BF32 ZERO ) 

Bt_hatp[2*hat_tc + i3 - 1 + BIAS_8BIT; 
else 
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#if 10 
#endif 



} 



Bt_hatp[2*hat_tc + i] » -1 + BIAS_8BIT; 



} 



Bt_hatp += hat_tc; 
++m; 



/* bump leading row pointer */ 
/* bump row */ 



#if 10 
#endif 

#if 10 
#endif 



else { /* skip second sum */ 

dotpr3_8bit( Bt hatp, Rl hatp, R0 hatp, Rlm__hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Btjiatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp[hat_tc + i] = -1 + BIAS_8BIT; 



} 



Bt_hatp += hat_tc; 
++m; 



Bt_hatp += hat_tc; 
++m; 



/* bump leading row pointer */ 

/* bump row */ 

/* bump leading row pointer */ 

/* bump row */ 



do last 0, 1 or 2 dot product calculations 



*/ 

while ( m < (N bits-2) ) { 

if ( BFABS ( Yp[m] ) < Ythresh ) { 

dotpr3_8bit{ Bt hatp, Rl hatp, R0 hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp [hat__tc + i] = -1 + BIAS_8BIT; 



#if io 
#endif 

} 

#if IO 



Bt_hatp += hat_tc; 
++m; 



#endif 



RO hatp += hat tc; 
Rl hatp += hat tc; 
Rim hatp += hat_tc; 
Yp += N_bitS; 



/* bump leading row pointer */ 



/* bump pointer */ 
/* bump pointer */ 
/* bump pointer */ 
/* bump pointer */ 

/* end of loop over N users */ 
/* end of loop over N_stages */ 



#if defined ( C0MPILE_C ) 

void dotpr3_8bit{ BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
^ BF32 *sums, int N, int tcols ) 

int j ; 

sums[0] = BF32_ZERO; 

for ( j = 0; j < N; j++ ) { 
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sums[0] (BF32)A[j3 * (BF32) BO [j ] ; 

sums[0] += (BF32)A[tcols+j] * (BF32)B1 [j] ; 
sums[0] += (BF32)A[(tCols«l)+j] * (BF32) B2 [ j ] ; 



} 



void dotpr6_8bit{ BF8 *A, BF8 *B0, BF8 *B1 / BF8 *B2, 
BF32 *sums, int N, int tcols ) 

int i, j; 

for ( i = 0; i < 2; i++ ) { 
sums[i] = BF32_ZERO; 
for ( j = 0; j < N; j++ ) { 

sums[i] += (BF32)A[i*tcols + j] * (BF32) B0 [j] ; 

sums[i] += (BF32)A[(i+l)*tcols + j] * (BF32) Bl [j] ; 

sums[ij += (BF32)A[ (i+2) *tcols + j] * (BF32) B2 [j ] ; 

>»' 

void dotpr9_8bit{ BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) 

int i, j; 

for ( i = 0; i < 3; i++ ) { 
sums [i] = BF32_ZERO; 
for ( j = 0; j < N; j++ ) { 

sumsfi] += (BF32)A[i*tcols + j] * (BF32) B0 [j] ; 
sums[ij (BF32)A[(i+l)*tcols + j] * (BF32) Bl [j ] ; 



} 



sumsti] +» (BF32)A[{i+2) *tcols + j] * (BF32) B2 [j 3 ; 



} 



void sve3_8bit( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ) 

int i ; 
BF32 wsum; 



wsutn = 0; 

for ( i m 0; i < n; i++ ) { 

wsum += (BF32) A[i] j 

wsum += (BF32)B[i] , 

^ wsum += (BF32)C[i] i 

^ *sum = wsum; 
#endif 
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MC Standard Algorithms PPC Macro language Version 



File Name: GEN R__MATRICES . MAC 

Description: Float and scale R matrix values, convert to byte. 

En try /pa rams : GEN__R_MATRICES ( Rsump, Bf scalep, Inv scalep, 

Scalep, No scale row bfp, 
Scale_row__bfp, Num_virt_users ) 

Formula : 

bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_yirt_users ; i++ ) { 
scale = scalep [i] ,- 
fsum = (float) (R eums[i]); 
fsum *= bf_scale; 

fsum scale = fsum * inv_scale; 
fsum_scale *- scale 

SATURATE ( fsum_scale ) 
SATURATE ( fsum ) 

no scale row bfp[i] - BF8 FIX( fsum ); 
scale_row_bfp [i] = BF8_FIX( fsum_scale ); 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 

0.0 
0.1 

0.3 



Date 

000910 
000914 

000920 



Engineer Reason 
fpl Created 

fpl Removed VMAXFP and added 

windin code 
fpl Removed all windin and windout 



#include "salppc.inc" 
#define DO_IO 1 
#if DO 10 

#define SCALE BUMP 16 16 
#else 

#define SCALE BUMP 16 0 
#endif 

#define STOREJSCALE ( vS, rA, rB ) STVX( vS, rA, rB ) 
#define ZER0_C0ND 6 
R0DATA_SECTION ( 6 ) 
STARTJL_ARRAY ( local J:able ) 



First stage for byte pack 
★ * / 

LJPERMUTEJ4ASK( 0x0004080c, 0x1014181c, 0x0004080c, 0x1014181c ) 
/* + 



334 



WO 02/073937 



PCT7US02/08106 



gen_rjnatrices.mac 2/23/2001 

Second stage for byte pack 
** / 

LJPERMUTE_MASK( 0x00010203, 0x04050607, 0x10111213, 0x14151617 ) 

END_ARRAY 

/** 

Input parameters 
**/ 

#define Rsump r3 
#define Bf scalep r4 
#define Inv scalep r5 
#define Scalep r6 
#define No scale row bfp r7 
ftdefine Scale row bfp r8 
#define Num__virt_users r9 



/** 






Local GPRs 




**/ 






#define 


indxl 


rlO 


#def ine 


indx2 


rll 


#def ine 


indx3 


rl2 


#define 


low4 


rO 


#def ine 


tptr 


indx2 


#def ine 


low4x4 


low4 


/** 






G4 registers 




**/ 






#def ine 


zero 


vO 


#def ine 


inv scale 


vl 


#def ine 


bf_scale 


v2 


#def ine 


byte pack 


v3 


#def ine 


byte_merge 


•\tA 


#def ine 


scaleO 


v5 


#def ine 


scalel 


v6 


#def ine 


vtmp 


scalel 


#def ine 


scale2 


v7 


#def ine 


vtmp2 


scale2 


#define 


scale3 


v8 


#def ine 


f sumO 


v9 


#def ine 


f suml 


vlO 


#def ine 


f sum2 


vll 


#def ine 


f sum3 


vl2 


#def ine 


fsum scaleO 


vl3 


#def ine 


fsum scalel 


vl4 


#define 


fsum scale2 


vl5 


#define 


f sum_scale3 


vl6 


#def ine 


bsumO 


vl7 


#def ine 


bsuml 


vl8 


#def ine 


bsum2 


V19 


#def ine 


bsum3 


v20 


#define 


bsum scaleO 


v21 


#define 


bsum scalel 


v22 


#define 


bsum scale2 


V23 


#define 


bsum_scale3 


v24 


#def ine 


bvector 


v25 
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#define bscale vector v26 



#define rsumO 
#define rsuml 
#define rsum2 
ttdefine rsum3 
#define seven 



v27 
v28 
v29 
v30 
v31 



/** 

Begin code text 
**/ 

FUNC_PROLOG 

ENTRY_7 ( gen R matrices, Rsump, Bf scalep, Inv scalep, Scalep, \ 
No_scale_jrow_bfp, Scale_rowJt>fp, Num_virt__users ) 

CMPWI{ Num_virt_users, 0 ) 

BGT ( start ) 

RETURN 

LABEL ( start ) 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

Load up permute vectors and loop scalers 
**/ 

LA( tptr, local_table, 0 ) 

LI( indxl, 16 ) 

LVX( bytejpack, 0, tptr ) 

VSPLTISB( seven, 7 ) 

LVX( byte merge, tptr, indxl ) 

SCALAR SPLAT ( bf scale, vtmp, Bf scalep ) 

SCALAR_SPLAT( inv_scale, vtmp, Inv_scalep ) 

/** 

Back up to nearest 16-byte boundary. It's okay to write before and after to 
nearest 16-byte boundary in both directions. 
**/ 

RLWINM ( low4. No scale_row_bf p, 0, 28, 31 ) /* lower 4 bits */ 

VX0R( zero, zero, zero ) 

ADD { Num virt users, Num virt users, low4 ) 

SUB{ No scale row bfp, No scale row bfp, low4 ) 

SUB ( Scale row bfp, Scale_row__bfp, low4 ) 

SLWI( low4x4, low4, 2 ) 

LI{ indx2, 32 ) 

SUB{ Rsump, Rsump, low4x4 ) 

/** 

Start up loop 
**/ 

LVX( rsumO, 0, Rsump ) 

LI{ indx3, 48 ) 

LVX( rsuml, Rsump, indxl ) 

SUB( Scalep, Scalep, low4x4 ) 

LVX ( rsum2, Rsump, indx2 ) 

VCFSX{ fsumO, rsumO, 0 ) 

LVX ( rsum3, Rsump, indx3 ) 

VCFSX( fsuml, rsuml, 0 ) 

LVX( scaleO, 0, Scalep ) 

VCFSX{ fsum2, rsum2, 0 ) 

LVX{ scalel, Scalep, indxl ) 

VCFSX( fsum3, rsum3, 0 ) 

LVX( scale2, Scalep, indx2 ) 

VMADDFP ( fsumO, fsumO, bf scale, zero ) 

LVX( scale3, Scalep, indx3 ) 

VMADDFP ( fsuml, fsuml, bf scale, zero ) 

ADDIC C( Num virt users, Num virt users, -16 ) 

VMADDFP ( fsum2, fsum2, bf_scale, zero ) 
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VMADDFP ( fsum3, fsum3, bf scale, zero ) 
VMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VMADDFP ( fsum scalel, fsuml, inv scale, zero ) 
VMADDFP ( fsum scale2, fsum2, inv_scale, zero ) 
ADDI ( Rsump, Rsump, 64 ) 

VMADDFP ( fsum_scale3, fsum3 f inv_scale, zero ) 
ADDI ( Scalep, Scalep, 64 ) 

VMADDFP { fsum scaleO, fsum scaleO, scaleO, zero ) 
VMADDFP ( fsum scalel, fsum scalel, scalel, zero ) 
VMADDFP ( fsum scale2, fsum scale2, scale2, zero ) 
VMADDFP ( fsum scale3, fsum_scale3, scale3, zero ) 
BLE ( sixteen_sums ) 

LVX( rsumO, 0, Rsump ) 

LVX( rsuml, Rsump, indxl ) 

VCTSXS< bsumO, f suraO , 24 ) 

LVX{ rsum2, Rsump, indx2 ) 

VCTSXS( bsuml, fsuml, 24 ) 

VCTSXS( bsum2, fsum2, 24 ) 

LVX( rsum3, Rsump, indx3 ) 

ADDI ( Rsump, Rsump, 64 ) 

VCTSXS{ bsum3, fsum3, 24 ) 

LVX( scaleO, 0, Scalep ) 

VCTSXS( bsum scaleO, fsum scaleO, 24 ) 

VCTSXS( bsum_scalel, fsum scalel, 24 ) 

LVX( scalel, Scalep, indxl ) 

VCTSXSt bsum__scale2, fsum scale2, 24 ) 

LVX( scale2, Scalep, indx2 ) 

ADDI( No scale row bfp, No scale row_bfp, -SCALE_BUMP_16 ) 

VCTSXS{ bsum scale3, fsum scale3, 24 ) 

ADDI ( Scale_row_bfp, Scale_row_bfp, - SCALE_BUMP_1 6 ) 

BR ( mloop ) 

/** 

Top of loop outputs 32 bytes per trip 
**/• 

LABEL ( loop ) 
/* { */ 

STORE SCALE ( bvector, 0, No scale_row bfp ) 
VCTSXS{ bsum_scale3, fsum scale3, 24 ) 
STORE_SCALE{ bsc al elector , 0, Scale_row_bfp ) 

LABEL ( mloop ) 

LVX( scale3, Scalep, indx3 ) 

VCFSX( fsumO, rsumO, 0 ) 

VPERM( bsumO, bsumO, bsuml, byte_pack ) 

VCFSX( fsuml, rsuml, 0 ) 

VCFSX{ fsum2, rsum2, 0 ) 

ADDI ( No scale row bfp, No_scale_row_bfp, SCALE_BUMP 16 ) 
VCFSX( fsum3, rsum3, 0 ) 

ADDI ( Scale rowjofp, Scale row bfp, SCALE_BUMP_16 ) 
VMADDFP ( fsumO, fsumO, bf scale, zero ) 
VPERM( bsum2, bsum2, bsum3, byte_pack ) 
VMADDFP ( fsuml, fsuml, bf scale, zero ) 
VMADDFP ( fsum2, fsum2, bf_scale, zero ) 

VMADDFP ( fsum3, fsum3, bf scale, zero ) 
VMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VPERM( bvector, bsumO, bsum2, byte merge ) 
VMADDFP ( fsum scalel, fsuml, inv scale, zero ) 
ADDIC C( Num virt users, Num_virt users, -16 ) 
VMADDFP ( fsum scale2, fsum2, inv scale, zero ) 
VMADDFP ( fsum_scale3, fsum3, inv_scale, zero ) 
ADDI ( Scalep, Scalep, 64 ) 

VMADDFP ( fsum scaleO, fsum scaleO, scaleO, zero ) 

VPERM( bsum scaleO, bsum scaleO, bsum scalel, byte_pack ) 

VMADDFP ( fsum_scalel, fsum_scalel, scalel, zero ) 
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VMADDFP ( fsum scale2, fsum scale2, scale2 , zero ) 
VMADDFP { fsum scale3, fsum scale3, scale3, zero ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3 / byte_pack ) 

VSRB ( vtmp, bvector, seven ) 
VPERM( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VSRB ( vtmp2 , bscale_yector, seven ) 
BLE { loop_flush ) 

LVX( rsumO, 0, Rsump ) 

VADDSBS( bvector, bvector, vtmp ) 
LVX( rsuml, Rsump, indxl ) 

VADDSBS( bscale vector, bscale_vector, vtmp2 ) 
LVX( rsum2, Rsump, indx2 ) 
VCTSXS( bsumO, fsumO, 24 ) 
LVX( rsum3, Rsump, indx3 ) 
VCTSXS( bsuml, fsuml, 24 ) 
ADDI ( Rsump, Rsump, 64 ) 
VCTSXS{ bsum2, fsum2, 24 ) 
LVX( scaleO, 0, Scalep ) 
VCTSXS( bsum3, fsum3, 24 ) 
LVX( scalel, Scalep, indxl ) 
VCTSXSf bsum scaleO, fsum scaleO, 24 ) 
VCTSXS( bsum_scalel, fsum scalel, 24 ) 
LVX( scale2, Scalep, indx2 ) 
VCTSXS( bsum scale2, fsum scale2, 24 ) 

/* 1 */ 

BR ( loop ) 

/** 

Flush loop 
**/ 

LABEL ( loop flush ) 

VADDSBS( bvector, bvector, vtmp ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS( bscale vector, bscale vector, vtmp2 ) 
STORE_SCALE( bscale vector, 0, Scale row bfp ) 
ADDI ( No scale row bfp, No scale row bfp, SCALE BUMP_16 ) 
ADDI ( Scale__row_bfp, Scale_row_bf p, SCALE_BUMP_16 ) 

LABEL ( sixteen_sums ) 

VCTSXS( bsumO, fsumO, 24 ) 

VCTSXS( bsuml, fsuml, 24 ) 

VCTSXS( bsum2, fsum2, 24 ) 

VCTSXS( bsum3, fsum3, 24 ) 

VCTSXS( bsum scaleO, fsum scaleO, 24 ) 

VPERM( bsumO, bsumO, bsuml, byte pack ) 

VCTSXS( bsum scalel, fsum scalel, 24 ) 

VPERM( bsum2, bsum2, bsum3, byte pack ) 

VCTSXS( bsum scale2, fsum scale2, 24 ) 

VPERM{ bvector, bsumO, bsum2, byte merge ) 

VCTSXS( bsum_scale3, fsum_scale3, 24 ) 

VPERM( bsum scaleO, bsum scaleO, bsum scalel, byte pack ) 
VPERM( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB{ vtmp, bvector, seven ) 
VPERM{ bscale vector, bsum scaleO, bsum_scale2, bytejnerge ) 

VADDSBS( bvector, bvector, vtmp ) 

VSRB( vtmp, bscale vector, seven ) 
STORE SCALE ( bvector, 0, No scale row bfp ) 

VADDSBS( bscale vector, bscale vector, vtmp ) 

STORE SCALE ( bscale_vector, 0, Scale row bfp ) 
/** ~ 

Return 
**/ 

LABEL ( ret ) 

FREE_THRU_v31( VRSAVE_COND ) 
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__***************************************** 

--♦♦★★★it***********************************************:******** 
_ _** 

--** Majority Voter Control Logic 
_ _** 

Description: This Module serves as a generic majority voter 

--** Author : Steven Imperiali/Mirza Cifric 

--** Date : 5-15-2000 

_ ..** 

_********************* **************************************** 
LIBRARY IEEE; 

USE IEEE.STD LOGIC 1164 . ALL; 
use ieee.std logic arith.all; 
use ieee.std logic unsigned. all; 
USE STD . TEXTIO . ALL ; 



ENTITY m_voter IS 
PORT ( 

elk 66 pal6 :IN std logic- 
reset 0 :IN Std logic; 
requestO 0 :IN std logic; 
requestl 0 :IN std logic ; 
request2 0 :IN std logic; 
request3 0 : IN std logic; 
request4 0 :IN std logic; 
healthyO 1 :IN std logic; 
healthyl 1 :IN std logic; 
healthy2 1 :IN std logic; 
healthy3 1 :IN std logic; 
healthy4 1 :IN std logic; 
voteout_0 :OUT std_logic) ; 



END mjvoter; 

ARCHITECTURE voter OF tn voter IS 

signal pro: STD_LOGIC VECTOR (3 downto 0) ; 

signal against: STD_LOGIC_VECTOR ( 3 downto 0) ; 

signal result: STD_LOGIC; 

BEGIN 



check result : process (requestOJD, request 1_0, request2_0, request 3_0, request 4_0,h 
ealthyO 1, 

healthyl_l,healthy2 l,healthy3 l,healthy4 1) 
variable pro: STTMLOGIC VECTOR (3 downto 0) ; 
variable against: STD LOGIC VECTOR (3 downto 0) ; 
variable solution: STDJLOGIC; 
begin 

pro:= "0000"; set number of pro voters 

against :=" 0000" ? 
set number of against voters-- Get the number of pros 
if <healthyO_l = »l< and request0_0= ' 1 1 ) then 
pro := pro + "0001" ; 
end if; 

if (healthyl 1='1» and requestl_0=' 1» ) then 
pro := pro + "0001"; 
end if; 

if <healthy2JU'l» and request2_0= ' 1 ' ) then 
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pro pro + "0001"; 
end if; 

if (healthy3 1= , 1 I and request3_0= ' 1 ' ) then 
pro := pro + "0001"; 
end if; 

if (healthy4 1= , 1 ! and request4_0= • 1 » ) then 
pro i- pro + "0001"; 
end if; 
Get the number of cons 

if (healthyO 1 = * 1 ! and request0_0= 1 0 * ) then 
against := against + "0001"; 
end if; 

if (healthyl 1 = and requestl_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy2 1 = l l' and request2_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy3 1 = , 1' and request3_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy4 1 ='1' and request4_0 ='0') then 
against : = against + "0001"; 
end if; 
final score 

if (pro = "0001" and against < "0001") then 

Solution := •1'; 

elsif (pro = "0010" and against < "0010") then 
solution : = • 1 1 ; 
elsif (pro = "0011" and against < "0011") then 

solution : = • 1 • ; 
elsif (pro = "0100" and against < "0011 11 ) then 
solution : = ' 1 • ; 
elsif (pro * "0101" and against < "0011") then 
solution := ' 1 ' ; 

else solution := ' 0'; 

end if; 

result <~ solution; put variable val into 

signal val 

voteout_0 <= solution; put variable val into 

signal val 



end process check_result; 



result_latch: process (reset_0, clk_66_pal6) 
begin 

IF (reset 0 « '0') THEN 

voteout 0 <= 1 1 ' ; 

ELSIF rising edge (elk 66 pal6) THEN 
IF result = » 0' THEN 

voteout_0 <= 1 0 ■ ; 

END IF; 

END IF; 
END PROCESS; 

END voter; 



341 



WO 02/073937 PCTAJS02/08 106 



m_yoter.vhd 3/9/2001 



342 



WO 02/073937 



PCTYUS02/08106 



mudlib.h 2/23/2001 



/**************** ****************★*************** + ★*** + + + + + 

* 

* FILENAME: mudlib.h 
* 

* CC NUMBER: 
* 

* ABSTRACT: 
* 

* USAGE: 
* 

* COMMENTS: 
* 

* AUTHOR: M. VinskUS 

* DATE: 1B-JUL-2000 
* 

**/ 

/* ©MERCURY. COPYRIGHT. H® */ 
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#ifndef MUDLIB H 
#def ine _MUDLIB_H 

/************************* *****************^ 
*** 

* INCLUDE FILES 

********************************* ******************************************* 
**/ 

#include <sal.h> 

/**************************************************************************** 
*** 

* DEFINED CONSTANTS 

********************************* ******************************************* 
**/ 

#define NUM FINGERS LOG 2 

#define NUM FINGERS_SQUARED LOG (2 * NUM FINGERS_LOG) 
#define NUM FINGERS (1 << NUM FINGERS LOG) 

#define NUM_FINGERS__SQUARED (1 « NUM_FINGERS_SQUARED_LOG) 

#define LI CACHE SIZE 32768 
#define L1__CACHE_LINE_SIZE 32 

#def ine LI CACHE ALIGN_LOG 5 

#define LI CACHE ALIGN (1 « LI CACHE ALIGN LOG) 
#define L1_CACHE_ALIGN_MASK (L1_CACHE_ALIGN - 1) 

#define R MATRIX ALIGN__LOG 5 

#define R MATRIX ALIGN (1 « R MATRIX ALIGN__LOG) 
#define R_MATRIX_ALIGN__MASK ( R_MATR I X_AL I GN - 1) 

#def ine ALTIVEC ALIGN_LOG 4 

#define ALTIVEC ALIGN (1 « ALTIVEC ALIGN_LOG) 
#define ALTIVEC_ALIGN_MASK (ALTIVEC_ALIGN - 1) 

#define BF CORR FRAC BITS 8 

#define BF_CORR_FACTOR ( (float) (1 << BF_CORR__FRAC__BITS) ) 

#define BF MPATH FRAC BITS 15 /* this should be dynamic */ 

#define BF__MPATH_FACTOR ( (float) (1 « BF__MPATH_FRAC_BITS) ) 

#define BF RSUMS FRAC_BITS { (2 * BF MPATH FRAC BITS) - 16 + 
BF CORR_FRAC BITS) ~ 
#define BF RSUMS FACTOR {(float) (1 « BF RSUMS FRAC BITS) ) 
#define BF_RSUMS_RFACTOR (1.0 / BF_R SUM S_F ACTOR) 

#define BF RY FRAC BITS 9 /* 0 <= BF RY FRAC BITS <= 14 */ 

#define BF RY FACTOR ((float) (1 << BF RY FRAC_BITS) ) " ™ 
#define BF__RY_RFACTOR (1.0 / BFJRY_FACTOR) 

#define BF COMBINED FACTOR ( (float) { 1 << 
(BF RSUMS FRAC BITS-BF RY FRAC BITS) ) ) 

#define BF_COMBINED_RFACTOR (1.0 / BF_COMBINED_FACTOR) 

#define BF8 ZERO 0 
#define BF8 MAX 0x7f 

#define BF8 RY ONE ( (BF8) (1 « BF RY FRAC BITS) ) 
#define BF16 RY ONE ( (BF16) (1 « BF RY FRAC BITS)) 
#define BF16 RY MONE (-BF16 RY ONE) 
#define BF16 ZERO 0 
#define BF16 MAX 0x7fff 
#define BF32 ZERO 0 

#define BF32 RY ONE ( (BF32) (1 « BF RY FRAC BITS) ) 
#define BF32_MAX 0x7fffffff ™ " 
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ttdefine BIAS_8BIT 1 

#define BFABS ( x ) ( { <x) >= 0) ? (x) : (-(x))) 
#define FABS ( f ) ( { (f ) >= 0.0) ? (f) : (-(f))) 

/**************************************************************************** 
*** 

* TYPE DEFINITIONS 

**************************************************************************** 
**/ 

typedef long BF32; 
typedef short BF16; 
typedef char BF8; 

typedef struct { 

BF8 real ,- 

BF8 imag; 
} COMPLEX J3F8; 

typedef struct { 

BF16 real; 

BF16 imag; 
} COMPLEX_BF16; 

typedef struct { 

BF32 real; 

BF32 imag; 
} COMPLEXJ3F32; 

/********** ****************************************************************** 
*** 

* MACRO DEFINITIONS 

******************* **************************************************-******* 
**/ 

/* assumes (-{2.0 * 7) - 0.5) < (bf_factor * s) < ((2.0 " 7) - 0.5) */ 

#define SFtoBFS ( bf factor, s ) \ 

( (BF8) { (bf_f actor) * (s) + ( { (s) > 0.0) ? 0.5 : -0.5) ) ) 

#define VFtoBF8 ( bf__f actor, v, bfv, n ) \ 

int i; \ 

float factor ■ bf factor; \ 

vsmulx ( v, 1, fcfactor, v, 1, n # 0 ) ,- \ 

for { i = 0; i < n; i++ ) \ 

bfv[i] = (v[i] > 0.0) ? (BF8)(v[i] + 0.5) : (BF8)(v[i] - 0.5); \ 



#define SBF8toF( bf rf actor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

#define VBF8toF( bf_rf actor, bfv, v, n ) \ 

int i; \ 

float rf actor = bf rf actor; \ 
for ( i = 0; i < n; i++ ) \ 

v[i] = (float) bfvfi] ; \ 
vsmulx ( v, l, fcrfactor, v, 1, n, 0 ); \ 



/* assumes (-(2.0 A 15) - 0.5) < (bf_factor * s) < ((2.0 * 15) - 0.5), */ 

#define SFtoBF16 ( bf factor, s ) \ 

((BF16) {(bfJEactor) * (s) . + (((s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF16( bf_f actor, v, bfv, n ) \ 
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float factor = bf factor,- \ 

vsmulx { (float *)v, 1, fcfactor, (float *)v, 1, n, 0 ) ; \ 
^ vfixrx ( (float *)v, 1, (BF16 *)bfv, 1, n, 0 ) ; \ 

. #define SBF16toF( bf rf actor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

#define VBF16toF( bf rf actor, bfv, v, n ) \ 
{ \ 

float rfactor =» bf rfactor; \ 
vfltx ( (short *)bfv, l, v, 1, n, 0 ); \ 
^ vsnrulx ( v, 1, &rfactor, v, 1, n, 0 ); \ 

/* assumes (-(2.0 A 31) - 0.5) < (bf_factor * x) < ((2.0 A 31) - 0.5) */ 

#define SFtoBF32 ( bf factor, s ) \ 

((BF32) ((bf_f actor) * (s) + (<<s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF32J bf_factor, v, bfv, n ) \ 

float factor = bf factor; \ 

vsmulx ( v, 1, &factor, (float *)bfv ( 1, n, 0 ) ; \ 
vfixr32x ( (float *)bfv, 1, (int *)bfv, 1, n, 0 ); \ 

#define SBF32toF( bf rfactor, bfs ) \ 
( (bf_rf actor) * (float) (bfs) ) 

#define VBF32toF( bf rfactor, bfv, v, n ) \ 
{ \ 

float rfactor = bf rfactor; \ 

vflt32x ( (int *)bfv, 1, v, 1, n, 0 ); \ 

vsmulx ( v, 1, fcrfactor, v, 1, n, 0 ); \ 

#define CORR SFtoBF ( s ) SFtoBF8 ( BF CORR FACTOR, s ) 

#define MPATH__VFtoBF ( v, bfv, n ) VFtoBFIS ( BF_MPATH_F ACTOR, v, bfv, ((n)«l) 

#define BHAT SFtoBF ( s ) ( (BF8) ( (s) + (float)BIAS 8BIT) ) 

#define BHAT SBFtoF ( bfs ) ( (float) (bfs) - (float)BIAS 8BIT) 

#define BHAT_VFtoBF( v, bfv, n ) \ " 

float bias « (float) BIAS 8BIT; \ 
vsaddx( v, 1, &bias, v, 1, n, 0 ) ; \ 
^ fixpixax( v,l, bfv, n, 0 ) ; \ 

#define BHAT_VBFtoF( bfv, v, n ) \ 

float bias = (float) ( -BIAS 8BIT) ; \ 
fltpixax( bfv, v, 1, n, 0 ) ; \ 
^ vsaddx( v, 1, &bias, v, 1, n, 0 ) ; \ 

#define RHAT SFtoBF ( S ) SFtoBF8 ( BF RY FACTOR, s ) 

#define RHAT SBFtoF ( bfs ) SBF8toF( BF RY RFACTOR, bfs ) 

tfdefine RHAT VFtoBF ( v, bfv, n ) VFtoBF8 ( BF RY FACTOR, v, bfv, n ) 

#define RHAT_VBFtoF( bfv, v, n ) VBF8toF( BF_RY_RF ACTOR , bfv, v, n ) 

^define Y SFtoBF ( s ) SFtoBF32 ( BF RY FACTOR, s ) 

#define Y SBFtoF ( bfs ) SBF32toF( BF RY RFACTOR, bfs ) 

#define Y VFtoBF ( v, bfv, n ) VFtoBF32 ( BF RY FACTOR, v, bfv, n ) 

#define Y__VBFtoF ( bfv, v, n ) VBF32toF( BF_RY_RFACTOR, bfv, v, n ) 
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#define MUDLIB__DECR__VTRT_USER { ptovjnap, phys_user, virt_user ) \ 

--virt user; \ 

if ( virt user < 0 ) { \ 

--phys user; \ 
^ virt_user = ptov_map [phys_user] - 1; \ 

tfdefine MuT)LIB_INCR__VIRTJJSER { ptovjnap, phys_user, virt_user ) \ 
++virt user; \ 

if { virt user == ptov_map [phys_user] ) { \ 

++phys user; \ 
^ virt_user = 0; \ 

/******************* ********************************************************* 
*** 

* PUBLIC FUNCTION PROTOTYPES 

************************************************** 
** / 

int mudlib get CorrO offset ( 

unsigned char *ptov_map, /* no more tlian 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_yirt_users, /* sum of ptovjnap over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptovjmap [start _j?hys_user] 

*/ 

); 

int mudlib get CorrO size { 

unsigned char *ptov__map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptovjnap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_yirt_user, /* must be < ptovjmap [start _phys_user] 

*/ 

int end phys user, /* zero-based index into ptov map */ 

int endjvirt_user /* must be < ptovjnap [end_jphys_user] */ 

) r 

int mudlib get Corrl offset ( 

unsigned char *ptov_jnap, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot__virt_users, /* sum of ptovjnap over all phys users 

*/ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptovjnap [startj)hys_user3 

*/ 

>; 

int mudlib get Corrl size { 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 
int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptovjnap over all phys users 

* I 

int start phys user, /* zero-based index into ptov map */ 

int start_virt__user, /* must be < ptovjnap [start _phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_jnap [end_phys_user] */ 

) / 
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int mudlib get R0 offset ( 

unsigned char *ptovjnap, 
int tot virt_users, 
*/ 

int start phys user, 
int start__virt_user 
*/ 

); 



/* no more than 256 virts. per phys */ 
/* sum of ptov_jnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys__user] 



int mudlib get R0 size ( 

unsigned char *ptov_map, 
int tot_virt_users, 



*/ 

int 

int 

*/ 

int 

int 



start 
start 



phys 
virt 



user, 
user, 



end phys user, 
end virt user 



/* 
/* 

/* 
/* 

/* 



no more than 256 virts. per phys */ 
sum of ptovjnap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start_phys_jiser] 

zero-based index into ptov map */ 
must be < ptov_map [end_phys_user] */ 



int mudlib get Rl offset ( 

unsigned char *ptov_jnap, 
int tot virt users, 
*/ ~~ 

int start phys user, 
int start virt user 



/* no more than 256 virts. per phys */ 
/* sum of ptovjnap over all phys users 

/* zero -based index into ptov map */ 
/* must be < ptov_jnap [start_phys_user] 



int mudlib get Rl size ( 

unsigned char *ptov_map, 
int tot_virt users, 



*/ 

int 

int 

*/ 

int 

int 



start phys user, 
s t ar t_vi r t_us e r , 

end phys user, 
end virt user 



/* 
/* 

/* 
/* 

/* 
/* 



no more than 256 virts. per phys */ 
sum of ptov_jnap over all phys users 

zero-based index into ptov map */ 
must be < ptovjnap [start_phys_userj 

zero- based index into ptov map */ 
must be < ptovjnap [end__phys_user] */ 



int mudlib get num_virt_users { 

unsigned char *ptov map, 
int start phys user, 
s t ar t vi r t_user , 



int 
*/ 
int 
int 



end phys user, 
end virt user 



>; 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptovjmap [start_ phys_user3 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [end_jphys_user] */ 



void mudlib get end user_j?air ( 



unsigned char *ptov map, 


/* 


int 


start phys user, 


/* 


int 


start_virt__user , 


/* 


*/ 






int 


num virt users, 


/* 


int 


*end phys user, 
* endjvi rt_user 


/* 


int 


/* 



); 

void mudlib gen R { 

COMPLEX BF16 
COMPLEX BF16 
COMPLEX BF8 
*/ 

COMPLEXJ8F8 
*/ 



*mpathl bf, 
*mpath2 bf, 
*corr_0_bf , 

*corr_l_bf , 



no more than 256 virts. per phys */ 
zero-based index into ptov map */ 
must be < ptovjmap [start_phys__userl 

number from start (must be > 0) */ 
zero-based index into ptov map */ 
will be < ptov__map [*end_phys_user] * 



/* adjusted for starting physical user 
/* adjusted for starting physical user 
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void 



void 



unsigned char *ptovjoap, 
float *bf scalep, 
*inv__scalep, 



float 
*/ 
float 
char 
BF8 
BF8 
BF8 
BF8 
int 
int 
int 
int 
int 

V 

int 



*scalep, 
*L1 cachep, 
*R0 upper bf , 
*R0 lower bf , 
*R1 trans_bf, 
*Rlm bf , 
tot phys users, 
tot virt users, 
start phys user, 
start virt user, 
end_phy s_us er , 

end virt user 



); 



mudlib 4R_to 3R ( 

BF8 *R0 upper bf , 

BF8 *R0 lower bf , 

BF8 *R1 trans bf , 

char *L1 cachep, 

BF8 *R0 bf , 

BF8 *R1 bf , 

int tot_virt_users 

) ; 



/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* adjusted for starting physical user 

/* start at O'th physical user */ 

/* must be 3 2 -byte aligned */ 



/* zero-based ("starting row") */ 
/* relative to start phys user */ 
/* actual number of "rows" to process 

/* relative to endjphys_user */ 



/* input matrix */ 
/* input matrix */ 
/* input matrix */ 

/* 32K-byte temp, 32-byte aligned */ 
/* output matrix */ 
/* output matrix */ 



mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) ; 

mudlib_ref ormat_corr ( COMPLEX *in_corr, 

COMPLEX BF8 *corr 0 bf, 
COMPLEX BF8 *corr l_bf, 
int num virt users, 
int numjmultipath ) ; 



void 

void 
/* 

* temp names ( v) 
*/ 

int mudlib get CorrO offset v ( 

unsigned char *ptovjmap, /* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptov__map over all phys users 



fixed_zidotprx { COMPLEX SPLIT *A, int I, COMPLEX SPLIT *B, int J , 
COMPLEX_SPLIT *C, int N, int X ) ; 



int num fingers, 
int tot_virt_users, 



*/ 

int start phys user, 
int start virt user 



/* zero-based index into ptov map */ 
/* must be < ptovjnap [start jhys_user] 



); 



int mudlib get Corrl offset v { 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptov__map over all phys users 



int num fingers, 

int tot virt users, 
*/ 

int start_phys_user, 



/* zero-based index into ptovjnap +/ 
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int start_virt_user /* 
*/ 

); 

int mudlib get R0 offset_y ( 

unsigned char *ptov_map, /* 
int tot virt_users, /* 
*/ 

int start phys user, /* 
int start_virt_user /* 
*/ 

); 

int mudlib get RO size v ( 

unsigned char *ptov_map, /* 
int tot_yirt_users, /* 
*/ 

int start phys user, /* 
int start virt user, /* 
*/ ~ 

int end phys user, /* 
int end_virt_user /* 

); 

int mudlib get Rl offset__v ( 

unsigned char *ptov_map, /* 
int tot virt_users, /* 
*/ 

int start phys user, /* 
int start virt user /* 
*/ 

); 

int mudlib get Rl size y ( 

unsigned char *ptovjnap, /* 
int tot_yirt_users, /* 

*/ 

int start phys user, /* 

int start virt_user, /* 
*/ 

int end phys user, /* 

int end_virt_user /* 

) ; 

#endif /* JMUDLIB_H +/ 



2/23/2001 
must be < ptovjnap [startjphys_user] 



no more than 256 virts. per phys */ 
sum of ptovjnap over all phys users 

zero -based index into ptov map */ 
must be < ptov_map [start_phys_user] 



no more than 256 yirts. per phys */ 
sum of ptovjnap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start_ phys_user] 

zero -based index into ptov map */ 
must be < ptov_map [end_jphys_user] */ 



no more than 256 virts. per phys */ 
sum of ptovjnap over all phys users 

zero-based index into ptov map */ 
must be < ptovjnap [start_phys_user] 



no more than 256 virts. per phys */ 
sum of ptovjnap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start jhys_user] 

zero-based index into ptov map */ 
must be < ptov_map tend_j?hys_user] */ 
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#include "mudlib.h" 

#define INDEX__5D TO_LIN(a0, al, a2, a3, a4, max al, max a2, max a3, max a4) \ 
((a4) + (max a4) * ((a3) + (max a3) * ( (a2) + (max a2) * ( (al) 
\ 

+ (max_al) * (aO) ) ) ) ) 



void 



mudlib reformat corr ( 
COMPLEX *in_corr, 
COMPLEX BF8 *corr 0 bf , 
COMPLEX BF8 *corr l_bf , 
int num virt users, 
int num_f ingers ) 



int i # j, q, ql; 



for ( i - 0; i < num__virt users? i++ ) { 

for ( j = (i+1) ; j < num virt users; j++ ) { 
for ( q = 0; q < num_f ingers; q++ ) { 

for ( ql = 0; ql <~num fingers; ql++ ) { 

corr_0_bf->real = CORR_SFtoBF( in_corr [INDEX 5D_T0_LIN( 

0, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num fingers) ] .real ); 
corr_0_bf->imag = CORR_SFtoBF{ in_corr [INDEX 5D_TO_LIN( 

0, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] . imag ) ; 

++corr 0 bf ; 



for ( i = 0; i < num virt users; i++ ) { 
for ( j - 0; j < num virt users; j++ ) { 
for ( q = 0; q < num_f ingers; q++ ) { 
for ( ql = 0; ql < num fingers; ql++ 



) f 



corrJL_bf->real = CORR_SFtoBF ( in_corr [INDEX 5D_TO_LIN( 

1* i* j* ql* q, 
num virt users, 
num virt users, 
num fingers, 
num fingers) ] .real ); 
corr_l_bf->imag » CORR__SFtoBF ( in_corr [INDEX 5D_T0_LIN( 

1/ i, j. ql, q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] .imag ) ; 

++corr_l_bf ; 
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#include "mudlib.h" 

void mtrans32 8bit ( 
BF8 *A, 
BF8 *C, 
*/ 

char *L1 cachep, 
int A ncols, 
int A nrows, 
int C_tcols 

); 

void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

); 

void mudlib_4R to 3R ( 

BF8 *R0 upper bf, 
*R0 lower bf, 
*R1 trans bf, 
*Ll_cachep, 



/* logically contiguous input 32 x 32 blocks */ 
/* output blocks separated by 32 * out_tc elements 



BF8 
BF8 
char 

*/ 
BF8 
BF8 
int 



*R0 bf , 
*R1 bf , 

tot virt users 



/* input matrix */ 
/* input matrix */ 
/* input matrix */ 

/* temp: 32K bytes, 32 -byte aligned 

/* output matrix */ 
/* output matrix */ 



} 



BF8 *R0 work; 

int i, nrows, R0_tcols, tcols; 

tcols = <tot_virt_users + R_MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK ; 
nrows = RJV1ATRIX ALIGN; 

for ( i = tot virt_users; i > 0; i -= R_MATRIX_ALIGN ) { 
if ( nrows > i ) nrows = i; 

mtrans32_8bit ( Rl trans bf , Rl_bf, Ll_cachep, tot_virt_users, 

nrows, tcols ) ; 
Rl transjbf += (tcols « R_MATRIX__ALIGN_LOG) ; 
Rl bf += R MATRIX ALIGN; 
} ~ " 

R0 work = R0 bf; 
R0 tcols = tcols ; 
nrows = R_MATRIX ALIGN; 

for ( i = tot virt_users; i > 0; i -= R_MATRIX_ALIGN ) { 
if ( nrows > i ) nrows = i; 

mtrans32 8b it ( R0 lower_bf , R0 work, LI cachep, i, nrows, tcols ) ; 
R0 lower bf += (ROJzcols « R MATRIX ALIGN LOG) ; 
R0 work += ( (tcols « R MATRIX__ALIGN_LOG) + R_MATRIX_ALIGN) ; 
R0 tcols -= R MATRIX ALIGN; 
} ~ " 

mtriangle_8bit ( R0_upper_bf, R0_bf, tot_virt_users ); 



#if COMPILE_C 

void mtrans32 8bit ( 
BF8 *A, 
blocks */ 
BF8 *C, 
*/ 

char *L1 cachep, 
int A_ncols , 



/* logically contiguous input A_nrows x A_ncols 

/* output blocks separated by 32 * C_tcols elements 
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int A nrows, 
int C tcols 

) 

{ 

BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i,' j ? 

(void) Ll_cachep; 

A tcols = (A ncols + R MATRI X__AL I GN_MAS K ) & ~R_MATR I X_AL I GN MASK; 
C_nrows = R_MATRIX_ALIGN; ~ 

while ( A ncols ) { 

if { A ncols < C_nrows ) C_nrows = A ncols; 
Ap = A; 
Cp = C; 

for ( i = 0; i < A_jirows; i++ ) { 

for { j o 0; j < C nrows; j++ ) 
Cp[j * C tcols] - Ap[j]; 

Ap += A tCOlS ; 
j Cp += 1; 

A += R MATRIX_ALIGN; /* input travels horizontally */ 

C += (C_tcols « R MATRIX_ALIGN__LOG) ; /* output travels vertically */ 
A ncols -= C nrows; 

void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

int A counter, A_tcols f altivec_N, C_tcols; 
int i, j; 

A counter = (N + R MATRIX_ALIGN_MASK) & ~R_MATRIX ALIGN MASK; 
C_tcols = Account er + 1; ~ 

altivec_N = (N + ALTIVEC_ALIGN_MASK) & ~ ALT I VE C_ALI GN_MASK ; 

for ( i = 0; i < N; i++ ) { 

for ( j = 0; j < altivec N; j++ ) 
C[j] = A[j] ; 

--altivec N; 
--A counter; 

A_tcols = (A counter + R_MATRIX_ALI GN_MAS K ) & ~R MATRIX ALIGN MASK; 
A += (A tcols + 1) ; — — 

C += C tcols; 

.» " 

#endif /* C0MPILE_C */ 
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MC Standard Algorithms PPC Macro Language Version 



File Name: mtrans32 8bit.mac 

Description: Perform N_tiles 32 x 32 byte transposes 

void mtrans32 8bit ( 
BF8 *A, 
BF8 *C, 



contiguous input 32 x 32 blocks 
output blocks separated by 
32 * out tc elements 



char *L1 cache, 
int A ncols, 
int A nrows, 
int C tcola 



BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i , j ; 

A_tcols - (A ncols + R MATRIX ALIGN MASK) & 

~R MATRIX ALIGN_MASK; 
C_nrows = R__MATRIX_ALIGN; 

while { A ncols ) { 

if ( A ncols < C_nrows ) Cjarows = A_ncols; 
Ap = A; 
Cp = C; 

for ( i = 0; i < A_nrows; i++ ) { 
for ( j = 0; j < c nrows; j++ ) 

Cp[j * C tcols] = ApCj] ; 
Ap += A tcols; 

^ Cp += 1; 

A += R MATRIX__ALIGN; 

C += (C_tCols « R MATRIX__ALI GN_LOG ) ; 
A_ncols -= C nrows; 



} 



} 



Restrictions: A, C and LI cache must all be 16-byte aligned. 
C_tcols must be a multiple of 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000913 fpl Created 



#include "salppc.inc" 

#define D0_PRE FETCH 1 

#define LOAD INPUT ( vT, rA, rB ) 
#define LOAD_CACHE( vT, rA, rB ) 

#define STORE CACHE ( vS, rA, rB ) 
#define STORE_OUTPUT ( vS, rA, rB ) 



LVXL( vT, rA, rB ) 
LVX{ vT, rA, rB ) 

STVX( vS, rA, rB ) 
STVX( vS, rA, rB ) 



#def ine R MATRIX ALIGN_LOG 5 

#def ine R MATRIX ALIGN (1 « R MATRIX ALIGN LOG) 

#define R_MATRIX_ALIGN_MASK ( R__MATRIX__ALIGN - 1)" 
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#def ine ALTIVEC ALIGN_LOG 4 

#define ALTIVEC ALIGN (1 « ALTIVEC ALIGN LOG) 

#define ALT I VE C_AL I GN_MAS K ( ALTI VEC__ALIGN - 1)" 

#if DO PREFETCH 

#define PREFETCH { rA, rB, STRM, DST_BUMP ) \ 

DSTT ( rA, rB, STRM ) \ 

ADD ( rA, rA, DST BUMP ) 
#else 

#define PREFETCH ( rA, rB, STRM, DST BUMP ) 
#endif 

/** 

Four permute vectors for output stage 
**/ 

RODATA_SECTION ( 5 ) 
START_L_ARRAY ( local_table ) 

L PERMUTE MASK{ 0x00010405, 0x08090c0d, 0x10111415, 0xl8191cld ) 
L PERMUTE MASK( 0x02030607, OxOaObOeOf, 0x12131617, Oxlalblelf ) 
L PERMUTE MASK{ 0x00020406, 0x080a0c0e, 0x10121416, 0xl81alcle ) 
LJPERMUTE_MASK{ 0x01030507, 0x090b0d0f, 0x11131517, 0xl91bldlf ) 

END_ARRAY 

/** 

Input parameters 



* * J 

#define A r3 

#define C r4 

#define Ll_cache r5 

#define NC r6 

#define NR r7 

#define TCC r8 

#define NC left NC 

#define TCA r9 

#define TCA4 rlO 

#define icount rll 

#define aptrO ' rl2 

#define aptrl rl3 

#define aptr2 rl4 

#define aptr3 rl5 

#define aindxO rl6 

#define aindxl rl7 

#define aindx2 rl8 

#define aindx3 rl9 

#define cptrO r20 

#define cptrl r21 

ftdefine cptr2 r22 

#define cptr3 r23 

ftdefine cindxO r24 

#define cindxl r25 

#define cindx2 r26 

#define cindx3 r27 

#define cindx4 aindxO 

#define cindx5 aindxl 

#define cindx6 aindx2 

#define cindx7 aindx3 
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#define out indxO aptrO 

#define out indxl aptrl 

#define out indx2 aptr2 

#define out_indx3 aptr3 



#define cptr cptrO 

#define outptrO cptrl 

#define outptrl cptr2 

#define TCC4 cptr3 

#define tptr icount 

#define temp aptr3 

#define Cbump rO 

#define dstp rO 



#define dst_code r28 
/** 

G4 registers 



* * / 

ttdefine aOO vO 
#define aOl vl 
#define a02 v2 
#define a03 v3 

ftdefine alO v4 
#define all v5 
#define al2 v6 
ttdefine al3 v7 

#define a20 v8 
#define a21 v9 
#define a22 vlO 
#define a23 vll 

#define a30 vl2 
#define a31 vl3 
#define a32 vl4 
#define a33 vl5 

#define cOO vl6 
tfdefine cOl vl7 
#define c02 vl8 
#define c03 vl9 

#define clO v20 
#define ell v21 
#define cl2 v22 
tfdefine cl3 v23 

#define c20 cOO 

ttdefine c21 cOl 

#define c22 c02 

#define c23 c03 

#define c3 0 clO 

#define c31 ell 

#define c32 cl2 

#define c33 cl3 

#define vtO v24 
#define vtl v25 
#define vt2 v26 
#define vt3 v27 

#define vt4 cOO 

#define vt5 cOl 
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#define vt6 c02 
#define vt7 c03 



#define 


vpO 


v28 


#define 


vpl 


v29 


#def ine 


vp2 


v30 


#def ine 


vp3 


v31 


#def ine 


CO 


aOO 


#def ine 


cl 


aOl 


#def ine 


C2 


a02 


#def ine 


C3 


a03 


#def ine 


C4 


alO 


#def ine 


c5 


all 


^define 


C6 


al2 


#define 


c7 


al3 


#def ine 


OUtO 


a2 0 


#def ine 


outl 


a21 


#def ine 


out 2 


a22 


#def ine 


out 3 


a23 


#def ine 


out 4 


a30 


#def ine 


out 5 


a31 


#def ine 


out 6 


a32 


#define 


out 7 


a33 



/** 

Text begins 
**/ 

FUNC PROLOG 

EOTRY_5( mtrans32_8bit, A, C, Ll_cache, N, TCC ) 
SAVE rl3 r28 

USE_THRU_v3l ( VRSAVE__COND ) 

ADDI( TCA, NC, R MATRIX_ALIGN MASK ) 
CMPWI( NC left, 32 ) 

RLWINM ( TCA, TCA, 0, 0, (31 - R__MATRIX_ALIGN_LOG) ) 
LA( tptr, local table, 0 ) 

MAKERS TREAM_CODE_I I R ( dst_code, 64, 4, TCA ) 

LVX( vpO, 0, tptr ) 

ADDI ( tptr, tptr, 16 ) 

LVX( vpl, 0, tptr ) 

ADDI { tptr, tptr, 16 ) 

XORI( temp. A, 32 ) 

LVX( vp2, 0, tptr ) 

ADDI ( tptr, tptr, 16 ) 

SLWI( TCA4, TCA, 2 ) 

LVX( vp3, 0, tptr ) 

BLE( cont ) 

ANDI C( temp, temp, 32 ) 
BR{ cont ) 

/** 

^Outer loop transposes 2 (or 1 at end) 32 x 32 tiles per trip 

LABEL { outer loop ) 
/* f */ " 

CMPWI( NC_left, 32 ) 

LABEL { cont ) 

ADD( dstp, A, TCA4 ) /* start prefetch advanced */ 

MR( aptrO, A ) 
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ADD ( dstp, dstp, TCA ) /* advanced further */ 

LI( aindxO, 0 ) 

ADD ( aptrl, aptrO, TCA ) 

LI( aindxl, 16 ) 

ADD ( aptr2, aptrl, TCA ) 

MR( cptrO, LI cache ) 

ADD ( aptr3, aptr2, TCA ) 

ADDI ( cptrl, cptrO, 512 ) 
LI ( cindxO, 0 ) 

LOAD INPUT ( aOO, aptrO, aindxO ) /*** begins next sequence ***/ 
LI( cindxl, 128 ) 

LOAD INPUT ( alO, aptrl, aindxO ) 
LI( cindx2, 256 ) 

LOAD INPUT ( a20, aptr2, aindxO ) 
LI{ cindx3, 384 ) 

LOAD INPUT { a30, aptr3, aindxO ) 
MR ( i count, NR ) 

BLE( input_loop_dol ) 

LI ( aindx2, 32 ) /* these are used only in two tile loop */ 

LOAD INPUT { a02, aptrO, aindx2 ) 
LI( aindx3, 48 ) 

LOAD INPUT { al2, aptrl, aindx2 ) 
ADDI ( cptr2, cptrl, 512 ) 

LOAD INPUT ( a22, aptr2, aindx2 ) 
ADDI ( cptr3, cptr2, 512 ) 

LOAD_INPUT( a32, aptr3 , aindx2 ) 

/** 

Top of input loop processes a 4 x 64 byte tile each trip 
**/ 

LABEL ( input loop do2 ) 
/* { */ 

PREFETCH { dstp, dst code, 0, TCA4 ) 
ADDIC C{ icount, icount, -4 ) 

VMRGHW(vtO, aOO, a20) /* vtO = aOO[0-33 a20 [0-3] a00[4-7] a20 [4-7] */ 
LOAD INPUT ( aOl, aptrO, aindxl ) 

VMRGLW(vt2, a00, a20) /* vt2 = aOO [8-b] a20 [8-b] a00[c-f] a20 [c-f ] */ 
LOAD INPUT ( all, aptrl, aindxl ) 

VMRGHW(vtl, alO, a30) /* vtl = al0[0-33 a30 [0-3] al0[4-7] a30 [4-7] */ 
LOAD INPUT ( a21, aptr2, aindxl } 

VMRGLW(vt3, alO, a30) /* Vt3 = alO [8-b] alO [8-b] a30[c-fj a30 [c-f ] */ 
LOAD_INPUT( a31, aptr3 , aindxl ) 

VMRGHW(c00, vtO, vtl) /* cOO = aOO [0-33 alO [0-3] a20[0-3] a30 [0-3] */ 

STORE CACHE ( cOO, cptrO, cindxO ) 

VMRGLW(c01, vtO, vtl) /* C01 « a0O[4-7] al0[4-7] a20[4-7] a30 [4-7] */ 

STORE CACHE ( cOl, cptrO, cindxl ) 

VMRGHW(c02, vt2, vt3) /* C02 « aOO [8-b] alO [8-b] a20[8-b] a30[8-b] */ 

STORE CACHE ( c02, cptrO, cindx2 ) 

VMRGLW(c03, vt2, vt3) /* c03 m aOO [c-f ] alO[c-f] a20[c-f] a30 [c-f ] */ 

STORE_CACHE( c03, cptrO, cindx3 ) 



VMRGHW(vt0, aOl, a21) /* vtO = 

LOAD INPUT ( a03, aptrO, aindx3 ) 

VMRGLW(vt2, aOl, a21) /* vt2 = 

LOAD INPUT < al3, aptrl, aindx3 ) 

VMRGHW (vtl, all, a31) /* vtl « 

LOAD INPUT ( a23, aptr2, aindx3 ) 

VMRGLW(vt3, all, a31) /* Vt3 = 

LOAD_INPUT( a33, aptr3, aindx3 ) 

VMRGHW(clO, vtO, vtl) /* clO = 
STORE CACHE ( clO, cptrl, cindxO ) 
VMRGLW(cll, vtO, vtl) /* ell = 



aOl [0 


-33 


a21 [0 


-3] 


a01[4 


-7] 


a21[4 


-73 


*/ 


aOl [8 


-b] 


a21[8 


-b] 


a01[c 


-f] 


a21 [c 


-f] 


*/ 


all[0 


-3] 


a31[0 


-3] 


all [4 


-7] 


a31 [4 


-7] 


*/ 


all [8 


-b] 


all [8 


-b] 


a31[c 


-f] 


a31 [c 


-f] 


*/ 


a01[0 


-3] 


all[0 


-3] 


a21[0 


-3] 


a31[0 


-33 


*/ 


a01[4 


-73 


all [4 


-7] 


a21[4 


-73 


a31[4 


-7] 


*/ 
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STORE CACHE ( ell, cptrl, cindxl ) 

VMRGHW(cl2, vt2, vt3) /* cl2 

STORE CACHE ( cl2, cptrl, cindx2 ) 

VMRGLW(cl3, vt2, vt3) /* Cl3 

STORE_CACHE( cl3, cptrl, cindx3 ) 

BLE ( f lush_input_loop_do2 ) 

ADD ( aindxO, aindxO, TCA4 } /* bump for next load sequence +/ 
ADD ( aindxl, aindxl, TCA4 ) 
ADD ( aindx2, aindx2, TCA4 ) 
ADD ( aindx3, aindx3, TCA4 ) 



= a01[8-b] all[8-b] a21[8-b] a31[8-b] */ 
= a01[c-f] all[c-f] a21[c-f] a31 [c-f ] */ 



VMRGHW(vtO, a02, a22) /* vtO = 

LOAD INPUT ( aOO, aptrO, aindxO ) 

VMRGLW(vt2, a02, a22) /* vt2 = 

LOAD INPUT ( a02, aptrO, aindx2 ) 

VMRGHW<vtl, a!2, a32) /+ vtl = 

LOAD INPUT ( alO, aptrl, aindxO )- 

VMRGLW(vt3 7 al2, a32) /* vt3 = 

LOAD_INPUT( al2, aptrl, aindx2 ) 

VMRGHW(c20, vtO, vtl) /* c20 = 

STORE CACHE ( c20, cptr2 , cindxO ) 

VMRGLW(c21, vtO, vtl) /* c21 = 

STORE CACHE ( C21, cptr2, cindxl ) 

VMRGHW(c22, vt2, vt3) /* C22 = 

STORE CACHE ( c22, cptr2 , cindx2 ) 

VMRGLW(c23, vt2, vt3) /* c23 = 

STORE_CACHE( c23, cptr2 , cindx3 ) 

VMRGHW(vtO, a03, a23) /* vtO « 

LOAD INPUT { a20 # aptr2, aindxO ) 

VMRGLW(vt2 , a03, a23) /* vt2 = 

LOAD INPUT ( a22 # aptr2 , aindx2 ) 

VMRGHW(vtl, al3 # a33) /* vtl = 

LOAD INPUT ( a30, aptr3 7 aindxO ) 

VMRGLW(vt3, al3, a33) /* vt3 = 

LOAD_INPUT( a32, aptr3 , aindx2 ) 



a02[0-3] a22[0>3] a02 [4-7] a22 [4-7] */ 
/*** begins next sequence ***/ 
a02[8-b] a22[8-b] a02[c-f] a22 [c-f 3 */ 



al2 [0 


-3] 


a32 [0 


-3] 


al2 [4 


-73 


a32 [4 


-73 


*/ 


al2 [8 


-b] 


al2 [8 


-b] 


a32 [c 


rf] 


a32 [c 


-f] 


*/ 


a02 [0 


-3] 


al2 [0 


-3] 


a22[0- 


-33 


a32[0 


-33 


*/ 


a02 [4 


-73 


al2[4 


-7] 


a22[4 


-7] 


a32 [4 


-73 


*/ 


a02 [8 


-b] 


al2[8 


-b] 


a22[8 


-b] 


a32[8 


-b] 


*/ 


a02 [c 


-f] 


al2 [c 


-fj 


a22[c 


-f] 


a32[c 


-f] 


*/ 


a03 [0 


-3] 


a23[0 


-3] 


a03 [4- 


-7] 


a23[4- 


-7] 


*/ 


a03(8< 


-bl 


a23 [8 


-bl 


a03[c- 


-f] 


a23 [c- 


-fl 


*/ 


al3 [0 


-3] 


a33£0 


-3] 


al3[4- 


-73 


a33 [4 


-73 


*/ 


al3 [8 


-b] 


al3[8 


-b] 


a33[c- 


-f] 


a33 [c 


-f] 


*/ 



VMRGHW(c30, vtO, vtl) /* c30 = 

STORE CACHE ( c30, cptr3 , cindxO ) 

VMRGLW(c31, vtO, vtl) /* c31 = 

STORE CACHE ( c31, cptr3 , cindxl ) 

VMRGHW(c32, vt2, vt3) / + c32 = 

STORE CACHE ( c32, cptr3 , cindx2 ) 

VMRGLW(c33, vt2, vt3) /* c33 = 

STORE_CACHE( c33, cptr3, cindx3 ) 



a03 [0- 


■33 


al3 [0 


-3] 


a23 [0« 


-33 


a33 [0- 


-33 


*/ 


a03 [4- 


•73 


al3[4 


-73 


a23 [4- 


-7] 


a33 [4- 


-7] 


*/ 


a03 [8- 


b] 


al3 [8 


-b] 


a23 [8- 


-b3 


a33 [8- 


-b] 


*/ 


a03 [c- 


f] 


al3 [c 


-f] 


a23 [c- 


-f] 


a33 [c- 


-fl 


*/ 



ADDI ( cindxO, 

ADDI ( cindxl, 

ADDI ( cindx2, 

ADDI ( cindx3 , 



cindxO, 16 ) /* bump for next store sequence */ 

cindxl, 16 ) 

cindx2, 16 ) 

cindx3, 16 ) 



BR( input_loop_do2 ) 
LABEL ( flush_input_loop_do2 ) 



VMRGHW(vtO, 


a02, 


a22) 


/* vtO = 


a02 


[0 


-33 


a22 [0 


-33 


a02 [4- 


73 


a22 [4 


-73 


*/ 


VMRGLW(vt2, 


a02, 


a22) 


/* vt2 = 


a02 


[8 


-b3 


a22 [8 


-b] 


a02 [c- 


•f] 


a22 [c 


-f] 


*/ 


VMRGHW(vtl, 


al2, 


a32) 


/* vtl = 


al2 


[0 


-33 


a32 [0 


-33 


al2 [4- 


73 


a32 [4 


-73 


*/ 


VMRGLW(vt3, 


al2, 


a32) 


/* vt3 = 


al2 


[8 


-b3 


al2 [8 


-b] 


a32[c- 


fl 


a32[c 


-f] 


*/ 


VMRGHW<c20, 


vtO, 


vtl) 


/* c20 = 


a02 


[0 


-3] 


al2 [0 


-3] 


a22 [0- 


3] 


a32 [0 


-3] 


*/ 


STORE CACHE ( 


c20, 


cptr2 , 


cindxO ) 












VMRGLW(c21, 


vtO, 


vtl) 


/* c21 = 


a02 


[4 


-7] 


al2 [4 


-73 


a22[4- 


7] 


a32 [4 


-73 


*/ 


STORE CACHE ( 


c21, 


cptr2 , 


cindxl ) 
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VMRGHW(c22, 


Vt2, 


vt3) 


/* c22 = 


a02 [8 


-b] 


al2 [8 


-b] 


a22 [8 


-b] 


a32 [8 


-b] 


*/ 


STORE CACHE ( 


C22 , 


cptr2, 


cindx2 ) 




















VMRGLW(c23, 


vt2, 


vt3) 


/* C23 = 


a02 [c 


-f ] 


al2 [c 


-f] 


a22 [c 


-f ] 


a32 [c 


-f ] 


*/ 


STORE CACHE { 


c23, 


cptr2 , 


cindx3 ) 


















VMRGHW(vtO, 


a03, 


a23) 


/* vtO = 


a03 [0 


-3] 


a23 [0 


-3] 


a03[4 


-73 


a23 [4 


-73 


*/ 


VMRGLW(vt2, 


a03, 


a23) 


/* vt2 = 


a03 [8 


-b] 


a23 [8 


-b] 


a03[c 


-f3 


a23 [c 


-f3 


*/ 


VMRGHW (vtl, 


al3, 


a33) 


/* vtl = 


al3 [0 


-3] 


a33 [0 


-3] 


al3[4 


-73 


a33 [4- 


-7] 


*/ 


VMRGLW(vt3, 


al3, 


a33) 


/* vt3 = 


al3 [8 


-b] 


al3 [8 


-bj 


a33 [c 


-f] 


a33 [c 


-f3 


*/ 


VMRGHW (c30, 


VtO, 


vtl) 


/* c30 = 


a03[0 


-3] 


al3 [0 


-33 


a23 [0 


-33 


a33 [0 


-33 


*/ 


STORE CACHE ( 


c30, 


cptr3 , 


cindxO ) 


















VMRGLW(c31, 


vtO, 


vtl) 


/* c31 = 


a03 [4 


-7] 


al3 [4 


-7] 


a23 [4 


-7] 


a33 [4 


-73 


*/ 


STORE CACHE { 


c31, 


cptr3 , 


cindxl ) 


















VMRGHW (c32, 


- vt2, 


vt3) 


/* c32 = 


a03 [8 


-b] 


al3[8 


-b] 


a23 [8 


-b] 


a33 [8 


-bj 


*/ 


STORE CACHE ( 


c32, 


cptr3 , 


cindx2 ) 


















VMRGLW(c33, 


vt2, 


vt3) 


/* c33 = 


a03[c 


-f] 


al3 [c 


-f] 


a23 [c 


-f3 


a33 [c 


-f] 


*/ 


STORE CACHE ( 


c33, 


cptr3 f 


cindx3 ) 



















MR { outptrO, C ) /* set for output loop in current pass */ 
SLWI( Cbump, TCC, 6 ) 
ADDI ( A, A, 64 ) 

ADD ( C, C, Cbump ) /* bump C for next pass */ 

LI { icount, 64 ) /* set icount for 2 tiles */ 

BR ( output_start ) /* join to common output loop */ 

j ** 

Top of input loop processes a 4 x 32 byte tile each trip 
**/ 

LABEL ( input loop dol ) 
/* { */ " " 

PREFETCH ( dstp, dst code, 0, TCA4 ) 
ADDIC C( icount, icount, -4 ) 

VMRGHW(vt0, aOO, a20) /* vtO = aOO [0-33 a20[0-33 a00[4-73 a20[4-7] */ 

LOAD INPUT ( aOl, aptrO, aindxl ) 

VMRGLW(vt2, aOO, a20) /* vt2 = a00[8-b3 a20[8-b3 a00tc-f3 a20 [c-f 3 */ 

LOAD INPUT ( all, aptrl, aindxl ) 

VMRGHW (vtl, alO, a30) /* vtl = alO [0-3] a30 [0-33 al0[4-73 a30[4-7] */ 

LOAD INPUT ( a21, aptr2, aindxl ) 

VMRGLW(vt3, alO, a30) /* vt3 = alO [8-b3 al0[8-b3 a30 [c-f 3 a30 [c-f 3 */ 

LOAD_INPUT{ a31, aptr3, aindxl ) 

VMRGHW(c00, vtO, vtl) /* cOO = aOO [0-33 alO [0-3] a20[0-3] a30[0-3] */ 

STORE CACHE ( cOO, cptrO, cindxO ) 

VMRGLW{C01, vtO, vtl) /* cOl = a00[4-7] alO [4-73 a20[4-7] a30[4-7] */ 

STORE CACHE ( cOl, cptrO, cindxl ) 

VMRGHW<c02, vt2, vt3) /* c02 = aOO [8-b] alO [8-b] a20[8-b3 a30[8-b3 */ 

STORE CACHE ( c02, cptrO, cindx2 ) 

VMRGLW(c03, vt2, vt3) /* c03 = aOO [c-f 3 al0[c-f3 a20[c-f] a30 [c-f 3 */ 

STORE_CACHE( c03, cptrO, cindx3 ) 

BLE ( f lush_input_loop_dol ) 

ADD ( aindxO, aindxO, TCA4 ) /* bump for next load sequence */ 
ADD { aindxl, aindxl, TCA4 ) 

VMRGHW (vtO, aOl, a21) /* vtO = a01[O-33 a21 [0-3] a01[4-73 a21[4-7] */ 

LOAD INPUT ( aOO, aptrO, aindxO ) /*** begins next sequence ***/ 

VMRGLW(vt2, aOl, a21) /* vt2 = a01[8-b] a21[8-b] a01[c-f3 a21[c-f] */ 

LOAD INPUT { alO, aptrl, aindxO ) 

VMRGHW (vtl, all, a31) /* vtl = all [0-33 a31[0-3] all [4-7] a31[4-7] */ 

LOAD INPUT ( a20, aptr2, aindxO ) 

VMRGLW(vt3, all, a31) /* vt3 = all [8-b3 all[8-b] a31[c-f] a31 [c-f 3 */ 

LOAD_INPUT( a3 0, aptr3, aindxO ) 

VMRGHW(clO, vtO, vtl) /* clO = aOl [0-3] all [0-3] a21[0-3] a31[0-33 */ 
STORE__CACHE ( clO, cptrl, cindxO ) 
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VMRGLW(cll, vtO, vtl) /* ell = a01[4-7] all [4-7] a21[4-7] a31[4-7] */ 

STORE CACHE ( ell, cptrl, cindxl ) 

VMRGHW(cl2, vt2, vt3) /* cl2 = aOl [8-b] all[8-b] a21[8-b} a31[8-b] */ 

STORE CACHE { cl2 , cptrl, cindx2 ) 

VMRGLW(cl3, vt2 # vt3) /* cl3 = aOl [c-f ] all [c-f 3 a21[c-f] a31 tc-f 3 */ 

STORE_CACHE ( cl3, cptrl, cindx3 ) 

ADDI ( cindxO, cindxO, 16 ) /* bump for next store sequence */ 

ADDI ( cindxl, cindxl, 16 ) 

ADDI ( cindx2 / cindx2 # 16 ) 

ADDI ( cindx3, cindx3, 16 ) 

BR ( input_loop_dol ) 

LABEL ( flush_input_loop_dol ) 



VMRGHW (vtO , 


aOl, 


a21) 


/* vtO = 


aOl tO- 


-3] 


a21[0 


-3] 


a01[4 


-7] 


a21[4 


-73 


*/ 


VMRGLW (vt2 , 


aOl, 


a21) 


/* vt2 = 


aOl [8 


-bj 


a21 [8 


-b] 


aOl [c 


-f] 


a21[c 


-f J 


*/ 


VMRGHW (vtl , 


all, 


a31) 


/* vtl = 


all [0 


-3] 


a31 [0 


-33 


all [4 


-73 


a31[4 


-7] 


*/ 


VMRGLW (vt3 , 


all, 


a31) 


/* vt3 = 


all [8 


-b] 


all [8 


-b] 


a31[c 


-f3 


a31[c 


-f] 


*/ 


VMRGHW (clO, 


vtO, 


vtl) 


/* clO = 


aOl [0 


-3] 


all [0 


-33 


a21 [0 


-3] 


a31[0 


-33 


V 


STORE CACHE ( 


clO, 


cptrl , 


cindxO ) 




















VMRGLW (ell, 


vtO, 


vtl) 


/* ell = 


aOl [4 


-7} 


all [4 


-7] 


a21[4 


-7] 


a31[4 


-73 


*/ 


STORE CACHE ( 


ell, 


cptrl, 


cindxl ) 




















VMRGHW <cl2, 


vt2, 


vt3) 


/* cl2 = 


a01(8 


-b] 


all [8 


-bj 


a21[8 


-b] 


a31[8 


-b3 


*/ 


STORE CACHE ( 


cl2, 


cptrl , 


cindx2 ) 




















VMRGLW (cl3, 


vt2, 


vt3) 


/* cl3 = 


a01[c 


-f] 


all [c 


-f] 


a21 [c 


~f] 


a31[c 


-f] 


V 


STORE CACHE ( 


C13, 


cptrl, 


cindx3 ) 





















MR( outptrO, C ) /* set for output loop in current pass */ 

SLWI< Cbump, TCC, 5 ) 
ADDI { A, A, 32 ) 

ADD ( C, C, Cbump ) /* bump C for next pass */ 

LI( icount, 32 ) /* set icount for 1 tile */ 



/** 

Second stage of transposition, write output 

LABEL ( output_start ) 

CMPW_CR( 6, icount, NCJLeft ) 

MR ( cptr, LI cache ) 

SLWI( TCC4 , TCC, 2 ) 

LI ( cindxO, 0 ) 

LI< cindxl, 16 ) 

LI( cindx2, 2*16 ) 

LI( cindx3 f 3*16 ) 

LI( cindx4, 4*16 ) 

LI( cindx5, 5*16 ) 

LI( cindx6, 6*16 ) 

BLE_CR( 6, PC OFFSET ( 8 ) ) 
MR ( icount, NC_left ) 

LI{ cindx7, 7*16 ) 

SUB{ NC_left, NCJLeft, icount ) 

■ ADDIC C( icount, icount, -4 ) 
LI ( out indxO , 0 ) 

LOAD CACHE ( cO, cptr, cindxO ) 

ADD ( out indxl, out indxO, TCC ) 

LOAD CACHE ( cl, cptr, cindxl ) 

ADD ( out indx2, out indxl, TCC ) 

LOAD CACHE ( c2, cptr, cindx2 ) 

ADD ( out_indx3, out_indx2 , TCC ) 
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LOAD CACHE ( c3 , cptr, cindx3 ) 

ADDI ( outptrl, outptrO, 16 ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM< vtO, CO, Cl, vpO ) 

LOAD CACHE ( c5, cptr, cindx5 ) 

VPERM( vtl, CO, cl, vpl ) 

LOAD CACHE ( c6, cptr, cindx6 ) 

VPERM( vt2, c2, C3, vpO ) 

LOAD CACHE { c7 , cptr, cindx7 ) 

VPERM( vt3, c2, c3, vpl ) 
ADDI ( cptr, cptr, 128 ) 
BR ( output__mloop ) 

/** 

Loop outputs four 32 byte rows 
** j 

LABEL ( output loop ) 

ADDIC_C{ icount, icount, -4 ) 
ADDI( cptr, cptr, 128 ) 



STORE OUTPUT < 

VPERM( out4, 
STORE OUTPUT ( 

VPERM( out5, 
STORE OUTPUT ( 

VPERM( out6, 
STORE OUTPUT { 

VPERM( out7, 

STORE OUTPUT ( 

VPERM( vtO, 
STORE OUTPUT ( 

VPERM( vtl, 
STORE OUTPUT ( 

VPERM( Vt2, 
STORE OUTPUT ( 

VPERM( vt3, 



outO, outptrO, out_indx0 ) 

vt4, vt6, vp2 ) 

out4, outptrl, out_indx0 ) 

vt4, vt6, vp3 ) 

outl, outptrO, out_indxl ) 

vt5, vt7, vp2 ) 

out5, outptrl, out_indxl ) 

vt5, vt7, vp3 ) 



out2, outptrO, 
cO, cl, vpO ) 
out 6, outptrl, 
cO, cl, vpl ) 
out3 , outptrO , 
c2, c3, vpO ) 
out7, outptrl, 
c2, c3, vpl ) 



out_indx2 ) 
out_indx2 ) 
out_indx3 ) 
out_indx3 ) 



ADD ( outptrO, outptrO, TCC4 ) 
ADD ( outptrl, outptrl, TCC4 ) 

LABEL ( output mloop ) 

BLE ( flush output_loop ) 

LOAD CACHE ( cO, cptr, cindxO ) 

VPERM( Vt4, C4, c5, vpO ) 
LOAD CACHE ( cl, cptr, cindxl ) 

VPERM( Vt5, C4, c5, vpl ) 
LOAD CACHE ( c2 , cptr, cindx2 ) 

VPERM( vt6, C6, c7, vpO ) 
LOAD CACHE ( c3 , cptr, cindx3 ) 

VPERM( vt7, c6, c7, vpl ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM( outO, vtO, vt2, vp2 ) 

LOAD CACHE { c5, cptr, cindx5 ) 

VPERM( outl, VtO, vt2, vp3 ) 

LOAD CACHE ( c6, cptr, cindx6 ) 

VPERM( out2, vtl, vt3, vp2 ) 

LOAD CACHE ( c7 , cptr, cindx7 ) 

VPERM( out3, vtl, vt3, vp3 ) 

BR ( outputJLoop ) 

LABEL ( flush_output_loop ) 

VPERM( vt4, c4, c5, vpO ) 
VPERM( vt5, c4, c5, vpl ) 
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VPERM( vt6, c6, c7, vpO ) 
VPERM{ Vt7, c6, c7, vpl ) 

CMPWI( icount, -3 ) 

VPERM( outO, vtO f Vt2, vp2 ) 

STORE OUTPUT ( outO, outptrO, out_indx0 ) 

VPERM( out 4, vt4, Vt6, vp2 ) 

STORE OUTPUT ( out4, outptrl, out_indx0 ) 
BEQ( oloop_next ) 

CMPWK icount, -2 ) 

VPERM( outl, vtO, vt2, vp3 ) 
STORE OUTPUT ( outl, outptrO, out_indxl ) 

VPERM( out5, vt4, vt6, vp3 ) 
STORE OUTPUT ( out 5, outptrl, out_indxl ) 
BEQ ( oloop_next ) 

CMPWK icount, -1 ) 
VPERM( out2, vtl, 
STORE OUTPUT ( out 2, 
VPERM( out 6, vt5, 
STORE OUTPUT ( out 6, outptrl, 
BEQ( oloop_next ) 



vt3, vp2 ) 

outptrO , out_indx2 ) 
Vt7, vp2 ) 

out indx2 ) 



VPERM( out3, vtl, 
STORE OUTPUT ( Out3 , 

VPERM( out7, vt5, 
STOREJ3UTPUT ( OUt7, 



vt3, vp3 ) 

outptrO , out_indx3 ) 
vt7, vp3 ) 

outptrl, out_indx3 ) 



/** 

Next four rows of C? 
** / 

LABEL ( oloop next ) 

BLT_CR( 6, outerJLoop ) 

/** 

Exit routine 
LABEL ( ret ) 

FREE THRU v3l ( VRSAVE_COND ) 

REST rl3_r28 

RETURN 



/* branch if icount < NC_left */ 



FUNC EPILOG 
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MC Standard Algorithms — PPC Macro language Version 



File Name: 
Description: 



mtriangle_8bit .mac . 

Move from an upper triangular matrix stored 
as a series of 32-line rectangles, each of 
width 32 elements less than its immediate 
predecessor to the upper triangle of an 
full N x N matrix. 



mtriangle_8bit { char *A, char *C, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000605 jg Created 



-*/ 



#include " salppc . inc n 

#define LOAD A< vT, rA, rB ) 
tfdefine LOAD C( vT, rA, rB ) 
#define STORE_C{ vS, rA, rB ) 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 
STVX( vS, rA, rB ) 



#define R MATRIX ALIGN__LOG 5 

#define R MATRIX ALIGN (1 << R MATRIX ALIGNJLOG) 

#define R_MATRIX_ALIGN_MASK ( R_MATR I X_AL I GN - 1) 

#define ALTIVEC ALIGN_LOG 4 
#define ALTIVEC ALIGN 
#define ALTIVEC ALIGN MASK 



(1 « ALTIVEC ALIGN_LOG) 
<ALTIVEC_ALIGN - 1) 



/*★ 

Input parameters 
**/ 

#define A 
#define C 
#define N 



r3 
r4 
r5 



tfdefine 


A tcols 


r6 


#define 


C tcols 


r7 


ftdefine 


altivec N 


r8 


#define 


A counter 


r9 


#def ine 


indexO 


rlO 


#def ine 


indexl 


rll 


#define 


index2 


rl2 


#define 


index3 


rl3 


#define 


count 


rO 


#define 


aO 


vO 


#define 


al 


vl 


#define 


a2 


v2 


#define 


a3 


v3 


#define 


CO 


v4 


#def ine 


shift 


v5 


#define 


shif t_incr 


v6 


#define 


mask 


v7 


#define 


left 


V8 


#define 


right 


v9 
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FUNC_PROLOG 

ENTRY_3( mtriangle_8bit, A, C, N ) 
SAVE rl3 

USEJFHRU_v9 ( VRSAVE_COND ) 

ADDI ( A counter, N, R MATRIX ALIGN_MASK ) 

VSPLTISW{ shif t_incr, 8 ) 
ADDI ( altivec N, N, ALTIVEC ALIGN_MASK ) 

VXOR( shift, shift, shift ) 
RLWINM ( A counter, A counter, 0, 0, (31 - R MATRIX ALIGN LOG) ) 
RLWINM ( altivec N, altivec N, 0, 0, (31 - ALTIVEC_ALIGN_LOG ) ) 
ADDI ( C_tcols, A_counter, 1 ) 

LABEL ( oloop ) 

ADDIC C( count, altivecJJ, -64 ) 
LOAD C( CO, 0, C ) 

VSPLTISW( mask, -1 ) 
LOAD A( a0, 0, A ) 

VSRO{ mask, mask, shift ) 
LI( indexO, 16 ) 

VANDC( left, cO, mask ) 
LI( indexl, 32 ) 

VAND( right, aO, mask ) 
LI ( index2 , 48) 

VOR( c0, left, right ) 
STORE C( CO, 0, C ) 
BLE ( dosmall ) 
LI ( index3 , 64 ) 

LABEL ( iloop ) 

LOAD A( aO, A, indexO ) 
ADDIC C( count, count, -64 ) 

LOAD A( al, A, indexl ) 

LOAD A( a2, A, index2 ) 

LOAD A( a3, A, index3 ) 

STORE C( aO, C, indexO ) 
ADDI ( indexO, indexO, 64 ) 

STORE C( al, C, indexl ) 
ADDI ( indexl, indexl, 64 ) 

STORE C( a2, c, index2 ) 
ADDI ( index2, index2, 64 ) 

STORE C( a3, C, index3 ) 
ADDI ( index3, index3, 64 ) 
BGT ( iloop ) 

LABEL ( dosmall ) 

ADDIC C( count, count, 48 ) 
BLE ( windout ) 

LABEL ( sloop ) 

ADDIC C( count, count, -16 ) 

LOAD A( aO, A, indexO ) 

STORE C( aO, C, indexO ) 
ADDI ( indexO, indexO, 16 ) 
BGT ( sloop ) 

LABEL ( windout ) 
DECR_C ( N ) 

VADDUWM( shift, shift, shift_incr ) 
ADDI { A counter, A_counter, -1 ) 
ADDI ( A, A, 1 ) 

ADDI{ A tcols, A counter, R_MATRIX_ALIGN_MASK ) 
DECR ( altivec__N ) 
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RLWINM ( A tcols, A_tcols, 0, 0, (31 - R_MATRIX ALIGN LOG) ) 
ADD( C, C, C tcols ) ~" " 

ADD ( A, A, A_tcols ) 
BNE ( oloop ) 

FREE THRU_v9( VRSAVE_COND ) 

REST rl3 

RETURN 

FUNC EPILOG 
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#i£ ! defined ( SALPPC_H ) 
#define SALPPCJI 

#if 0 

+ **** ************* ************************^ 

*** MC Standard Algorithms PPC Version *** 
*******************************************^ 

* * 

* File Name: salppc.h * 

* Description: SAL macro include file * 

* * 

* Source files should have extension .mac. For example, vadd.mac * 

* and must include this file (salppc.h). * 

* * 

* To assemble for PPC ucode, use the following basic * 

* makefile build rule: * 

* * 

* . SUFFIXES : .mac .c .s .o * 

* * 

* .mac.o: * 

* cp $*.mac $*.c * 

* ccmc -o $*.s -E $*.c * 

* ccmc -c -o $*.o $*.s * 

* rm -f $*.s * 

* rm -f $*.c * 

* * 

* To compile for C, use the following basic makefile build rule: * 

* * 

* . SUFFIXES : .mac .c .o * 

* * 

* . mac . o : * 

* cp $*.mac $*.c * 

* ccmc -DCOMPILE_C -c-o$*.o$*.c * 

* rm -f $*.c * 

* * 

* The first 8 function arguments are passed in GPR registers * 

* r3 - rlO . Arguments beyond 8 are passed on the stack and may * 

* be obtained with the GET_ARG8 , GET_ARG9 , . . . GET ARG15 macros. * 

* Additional GPR registers should be assigned in ascending order * 

* starting from the last function argument. These may be declared * 

* with the DECLARE_rx[ ry] macros. For example, a function with * 

* 5 arguments that requires 3 additional GPR registers would * 

* issue: DECLARE r8 rlO. rO, if required, should be declared * 

* separately with the DECLARE rO macro. GPR registers above rl2 * 

* must be saved and restored using the SAVE_rl3 [_ry] and * 

* REST_rl3 [_ry] macros, respectively. * 

* " * 

* FPR registers should be assigned in ascending order starting * 

* with f 0 [dO] . These may be declared with the DECLARE_f 0 [_fy] * 

* or DECLARE dO [ dy] macros . * 

* For example, DECLARE fO fll. FPR registers above f 13 [dl3] must * 

* be saved and restored using the SAVE f 14 C f y] and REST f 14 [_f y] * 

* or SAVE_dl4 [_dy] and REST_dl4 [_dy] macros, respectively. * 

* * 

* All variables must be assigned a register using the * 

* pre -processor #define directive. GPR registers are named * 

* rO - r31; Single precision FPR registers are named fO - f31. * 

* Double precision FPR registers are named dO - d31. Different * 

* variables may be assigned to the same register as in: * 

* * 

* #def ine vara fl2 * 

* #def ine varb fl2 + 

* * 

* Functions must begin with the FUNC_PROLOG macro and end * 

* with the FUNC_EPILOG macro. ~ * 

* * 
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Macros are provided for both Fortran and C entry points. 

The GET SALCACHE macro should be used to get the address of 
the "current" salcache buffer into a GPR register. 

Avoid terminating macro lines with a semicolon. 

The following example demonstrates typical usage: 

# include "salppc.h" 

/* 

* assign variables to registers 



* 


ttUC JL J. lie 


A 


r3 




* 


iirl«3i"T tip* 

Tr*-*~ J LAIC 


I 


r4 




* 




B 


r5 




* 


#def in© 


J 


r6 


it 


* 


iidp'f t n© 

TfvJlMli «A.£AC 


C 


r7 




* 


tr*JCJ — liic 


K 


r8 


•ff 


* 


iidef in© 


D 


r9 




* 


Hcif^f i n© 


h 


rlO 




* 


Tr VJLC. J- 


N 


rl2 


* 


* 


#def ine 


EFIiAG rll 


* 


* 


#def ine 


count rll 


* 




#define 


to 


rl3 


* 
* 


* 


#def ine 


tl 


rl3 


* 


* 


#def ine 


t2 


rl4 


* 


* 


#def ine 


t3 


rl4 


* 




#define 


t4 


rl5 


* 


* 


#define 


t5 


rl5 


* 


* 


#define 


t6 


rl6 


* 


* 


#def ine 


aO 


fO 


* 


* 


#def ine 


al 


fl 


+ 




#def ine 


a2 


f2 


* 


* 


#def ine 


a3 


f3 


+ 


* 


#define 


bO 


f4 


* 


* 


#def ine 


bl 


f5 


* 


* 


#def ine 


b2 


f6 


* 




#def ine 


b3 


f7 


* 


* 


#def ine 


CO 


f8 


* 


★ 


#def irie 


cl 


f9 


* 


* 


#def ine 


C2 


flO 


* 


* 


#def ine 


C3 


fll 


* 


* 


#def ine 


d0 


fl2 


* 


* 


#def ine 


dl 


fl3 


+ 


★ 


#def ine 


d2 


fl4 


* 


* 


#def ine 


d3 


fl5 


* 



FUNCJPROLOG 

#if > defined ( COMPILE_C ) 
U ENTRY (foo ) 

FORTRAN DREF 4(1, J, K, L) 
FORTRAN_DREF_ARG8 

U ENTRY (foo) 
LI (EFLAG, 0) 
BR (common) 



/* must precede function */ 



U ENTRY (foo x ) 
FORTRAN DREF 4(1, 
FORTRAN DREF ARG8 
FORTRANJDREF ARG9 
#endif 



J, K, L) 
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* . * 

* ' ENTRY 10(foo x, A, I, B, J, C, K, D, L, N, EFLAG) * 

* DECLARE rl3 rl6 * 

* DECLARE fO fl5 * 

* GET_ARG9( EFLAG ) /* get the 9 ' th arg (EFLAG) off stack */ * 

* * 

* LABEL (common) * 

* * 

* SAVE CR /* needed if using fields 2,3 or 4 */ * 

* SAVE rl3 rl6 * 

* SAVE fl4_fl5 * 

* SAVE_LR /* needed if making a function call */ * 

* ~ * 

* GET_ARG8 ( N ) /* get the 8'th arg (N) off stack */ * 

* * 

* /* . . . body of function . . . */ * 

* " * 

* REST CR * 

* REST r!3 rl6 * 

* REST fl4_fl5 * 

* REST LR * 

* RETURN * 

* * 

* FUNC_EPILOG /* must conclude function */ * 

* * 

* Mercury Computer Systems, Inc. * 

* Copyright (c) 1996 All rights reserved * 

" * 

* Revision Date Engineer; Reason * 
* * 

* 0.0 960223 jg; Created * 

* 0.1 970109 jfk; Added POSTING BUFFER COUNT and made * 

* TEST IF DCBZ macro time "stw" instead * 

* of doing the TEST IF DCBT macro (lwz) * 

* 0.2 970124 jfk; Added SALCACHE ALLOC SIZE , * 

* ALIGN SALCACHE, CREATE_SALCACHE__FRAME * 

* DESTROY SALCACHE FRAME * 

* 0.3 970521 jfk; Added SET DCB [TZ] COND macros. * 

* Made old macros not assemble * 

* 0.4 980813 jfk; Changes SALCACHE ALLOC SIZE for 750 * 



#endif /* header */ 

# include <math.h> 

#define uchar unsigned char 
#define ulong unsigned long 
#define ushort unsigned short 

#define CR _cr 
#define CTR _ctr 
#define VSCR _yscr 

/* 

* define a structure to represent a VMX register 
*/ 

typedef union { 

char c [16] ; 

uchar uc [16] ; 

short s [8] ; 

ushort us [8] ; 

long 1[4] ; 

ulong ul [4] ; 

float f [4] ; 
} VMX_reg; 

#define FUNC PROLOG 
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ttdefine FDNC EPILOG \ 
} ~ " 

#define TEXT_SECTION( logb2_align ) 
^define DATA_SECTION ( logb2_align ) 
#define R0DATA_SECTION ( logb2_align ) 
/* 

* macro for C extern declarations 
V 

#define EXTERN_DATA< symbol ) \ 
extern long symbol; 

#define EXTERN_FUNC( func ) \ 
extern void func { void ) ; 

/* 

* macro for a global declaration 
*/ 

#define GLOBAL ( symbol ) 
/* 

* macro for a local declaration 
*/ 

#define LOCAL ( symbol ) 
/* 

* macros for creating static arrays 
*/ 

#define START J\RRAY( type, name ) \ 
type name##[] = { 

#define START C ARRAY ( name ) START ARRAY ( char, name ) 
#define START UC ARRAY ( name ) START ARRAY ( uchar, name ) 
#define START S ARRAY ( name ) START ARRAY ( short, name ) 
#define START US ARRAY ( name ) START ARRAY ( ushort, name ) 
#define START L ARRAY ( name ) START ARRAY ( long, name ) 
#define START UL ARRAY ( name ) START ARRAY ( ulong, name ) 
#define START_F_ARRAY ( name ) START_ARRAY { float, name ) 

#define END ARRAY \ 

}; 

#define DATA( dl ) \ 
dl, 

#define DATA2 ( dl, d2 ) \ 
dl, d2, 

#define DAT A4 ( dl, d2, d3, d4 ) \ 
dl, d2, d3, d4, 

#define DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 
dl, d2, d3, d4, d5, d6, d7, d8, 

#define C DATA( dl ) DATA ( dl ) 

#def ine UC DATA ( dl ) DATA ( dl . ) 

#define S DATA( dl ) DATA( dl ) 

#define US DATA ( dl ) DATA( dl ) 

#define L DATA( dl ) DATA( dl ) 

#define UL DATA( dl ) DATA{ dl ) 

#define PCDATA ( dl ) DATA{ dl ) 

#if defined ( LITTLE-ENDIAN ) 
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tfdefine D_DATA( dl, 62 ) DAT A2 { d2, dl ) 
#else 

#define D_D ATA ( dl, d2 ) DATA2 ( dl, d2 ) 
tfendif 

ttdefine C D ATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 

#define UC D ATA2 { dl, d2 ) DATA2 ( dl, d2 ) 

#define S DATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 

tfdefine US DATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 

#define L DATA2 ( dl, d2 } DATA2 { dl, d2 ) 

^define UL D ATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 

^define FJ3ATA2 ( dl, d2 ) DATA2 ( dl, d2 ) 

#define C DATA4 ( dl, d2, d3, d4 ) D AT A4 ( dl, d2, d3, d4 ) 

#define UC DATA4 { dl, d2, d3, d4 ) D AT A4 ( dl, d2, d3, d4 ) 



2/23/2001 



#define S DATA4 ( dl, d2, d3, d4 ) 
#define US DATA4 ( dl, d2, d3, d4 ) 
#define L DATA4 ( dl, d2, d3, d4 ) 
#define UL DATA4 ( dl, d2, d3, d4 ) DATA4 { dl, d2, d3 , d4 ) 
#define P DATA4 ( dl, d2, d3, d4 ) DATA4 ( dl, d2, d3 , d4 ) 



DATA4 ( dl, d2, d3, d4 ) 
D AT A4 ( dl, d2, d3, d4 ) 
DAT A4 { dl, d2, d3, d4 ) 



#define C DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) 
#define UC DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3 , d4, d5, d6, d7, d8 ) 
#define S DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) 
#define US DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) 
#define L DATA8 ( dl, d2, d3 , d4, d5, d6, d7, d8 ) \ 

DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) 
#define UL DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA8 ( dl, d2, d3, d4 , d5, d6, d7, d8 ) 
#define F DATA8 { dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) 

/* 

* macros for creating vmx permute masks < 12 8 -bits) 
*/ 

#if defined ( LITTLE_ENDIAN ) 



#def ine 


L 


PERMUTE MUNGE { 


1 


) 


( (1) 


* Oxiclclclc ) 


#define 


S 


PERMUTE MUNGE ( 


s 


) 


'( (s) 


A Oxlele ) 


#define 


C_ 


PERMUTE_MUNGE( 


c 


} 


( (c) 


A Oxlf ) 


#def ine 


L 


INDEX MUNGE ( x 


) 


( 


(x) * 


0x3 ) 


#def ine 


S 


INDEX MUNGE ( X 


) 


( 


(x) A 


0x7 ) 


#def ine 


C_ 


_INDEX_MUNGE ( X 


) 


( 


(x) * 


Oxf ) 


#else 














#define 


L 


PERMUTE MUNGE ( 


1 


) 


( 1 ) 




#def ine 


S 


PERMUTE MUNGE ( 


s 


) 


( s ) 




#def ine 


C_ 


_PERMUTE_MUNGE ( 


c 


) 


( c ) 




#def ine 


L 


INDEX MUNGE ( x 


) 


( 


x ) 




#define 


S 


INDEX MUNGE ( x 


) 


( 


x ) 




#def ine 


C 


INDEX MUNGE { x 


) 


( 


x ) 





#endif 



#define L PERMUTE MASK{ 11, 12, 13, 14 ) \ 

L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE ( 12 ) , \ 

L_PERMUTE_MUNGE ( 13 ) , L_PERMUTE_MUNGE ( 14 ) , 

#define S PERMUTE MASK( si, s2, s3, s4, s5, s6, s7, s8 ) \ 
S_PERMUTE_MUNGE ( si ), S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE ( s3 ) , S PERMUTE MONGE ( S4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 
S_PERMUTE_MUNGE ( S7 ) , S_PERMUTE_MUNGE ( s8 ) , 

#define C_PERMUTE_MASK { Cl, C2, c3, C4, c5, c6, C7, c8, \ 

c9, clO, ell, cl2, cl3 r cl4, cl5, cl6 ) \ 
C PERMUTE MUNGE ( Cl ) , C PERMUTE MUNGE { c2 ) , \ 
C PERMUTE MUNGE ( c3 ) , C PERMUTE MUNGE ( c4 ) , \ 
C PERMUTE MUNGE ( c5 ) , C PERMUTE MUNGE ( c6 ) , \ 
C PERMUTE MUNGE ( C7 ) , C PERMUTE MUNGE ( c8 ) , \ 
C PERMUTE MUNGE ( c9 ) , C PERMUTE MUNGE ( ClO ), \ 
C PERMUTE MUNGE ( ell ) ( C PERMUTE MUNGE ( cl2 ) , \ 
C PERMUTE MUNGE ( cl3 ) , C PERMUTE MUNGE ( Cl4 ), \ 
C_PERMUTE__MUNGE ( Cl5 ), C__PERMUTE_MUNGE ( Gl6 ), 

/* 

* macro for a microcode entry point (e.g. vaddx, vaddx_) 

* U ENTRY is a "nop" for C code 
*/ ~ 

#define U_ENTRY( funcjiarae ) 
/* 

* macros for C function prototypes 
*/ 

#def ine C PROTOTYPE_0 ( f uncjiame ) \ 
void f unc_name ( void ) ; 

#define C PROTOTYPE^ ( func_name ) \ 
void func_name ( long ) ; 

#def ine C PROTOTYPE^ ( func name ) \ 
void f unc_name ( long, long ) ; 

#define C PROTOTYPE^ ( func name ) \ 
void func_name ( long, long, long ) ; 

#define C PR0T0TYPE_4 ( func name ) \ 

void func_name ( long, long, long, long ) ; 

ttdefine C PROTOTYPE^ ( func name ) \ 

void func_name ( long, long, long, long, long ) ; 

#define C PROTOTYPE_6{ func name ) \ 

void func_name { long, long, long, long, long, long ) ; 

ttdefine C PROTOTYPE_7 { func name ) \ 

void func_name ( long, long, long, long, long, long, long ); 

#define C PROTOTYPE_8 { func name ) \ 

void func_name ( long, long, long, long, long, long, long, long ) ; 

#define C PROTOTYPE^ ( func name ) \ 

void func_name { long, long, long, long, long, long, long, long, \ 
long ) ; 

#define C PROTOTYPE_10 ( func name ) \ 

void funcjiame ( long, long, long, long, long, long, long, long, \ 
long, long ) ; 

#define C PR0T0TYPE_11 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long ) ; 

#define C PR0T0TYPEJL2 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long ) ; 
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#define C PROTOTYPE_13 < func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long ) ; 

#define C PROTOTYPE^ 4 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long ) ; 

#define C PROTOTYPE_15 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long ) ; 

#define C PROTOTYPE_16 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long, long ) ; 

#define AUTO_r3 r31 \ 

long r3, r4, r5, r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, 
\ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r4 r31 \ 

long r4, r5, r6, r7, r8, r9, rlO, rll, r!2, rl3, rl4, rl5, rl6, rl7, \ 
r!8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r5 r31 \ 

long r5, r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r6 r31 \ 

long r6, r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r7 r31 \ 

long r7, r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO_r8 r31 \ 

long r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO__r9 r31 \ 

long r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rlO r31 \ 

long rid, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rll r31 \ 

long rll, rl2, rl3, r!4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rl2 r31 \ 

long rl2, rl3, rl4, rl5, rl6, rl7, \ 

rl8, rl9, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, 
r31; 

#define AUTO rl3 r31 \ 

long rl3, rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTO rl4 r31 \ 

long rl4, rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
tfdefine AUTO rl5 r31 \ 

long rl5, rl6, rl7, rl8, rl9, r20, r21, r22, r23, r24, r25, \ 
r26, r27, r28, r29, r30, r31; 
#define AUTOj:16_r31 \ 
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long 


rl6, 


rl7, rl8, 


rl9, 


r20, 


r21, 




r26, 


r27, r28, 


r29, 


r30, 


r31; 


#def ine 


AUTO 


rl7 r31 \ 








long 


rl7, 


rl8, rl9, 


r20, 


r21, 


r22, 




r26, 


r27, r28, 


r29, 


r30, 


r31; 


#def ine 


AUTO 


rl8 r31 \ 








. long 


rl8, 


rl9, r20, 


r21, 


r22, 


r23, 




r26, 


r27, r28, 


r29, 


r30, 


r31; 


#def ine 


AUTO 


rl9 r31 \ 








long 


rl9, 


r20 # r21, 


r22, 


r23, 


r24, 


r26, 


r27, r28, 


r29, 


r30, 


r31; 


#def ine 


AUTO 


fO f31 \ 









r2S, \ 



float fO, fl, £2, f3, f4, f5, f6, £1 , f8 # f9, flO, fll, fl2, fl3, fl4, \ 
fl5, fl6, fl7, fl8, fl9, f20, f21, f22, f23, f24, f25, f26, f27, \ 
f28, f29, f30, f31; 

#define AUTO dO d31 \ 

double do, dl, d2, d3, d4, d5, d6, d7, d8, d9, dlO, dll, dl2, dl3, dl4, \ 
dl5, dl6, dl7, dl8, dl9, d20, d21, d22, d23 f d24, d25, d26, d27, \ 
d28, d29, d30, d31; 

#if defined ( BUILD MAX ) 
#define AUTO v0_v31 \ 

VMX_reg vO, vl f v2, v3, v4, v5, v6, v7, v8, v9, vlO, vll, vl2, vl3 , vl4, 

vl5, vl6, vl7, vl8, vl9, v20, v21, v22, v23, v24 , v25, v26, v27, \ 
v28, v29, v30, v31; 

#endif 
/* 

* For C implementation, create a dummy stack on function entry of size 
4096. 

*/ 

#define STACK_SIZE 4096 
/* 

* macros for C and Fortran callable entry points 
*/ 

#def ine ENTRY 0 ( f unc name ) \ 
C PROTOTYPE 0( func name ) \ 
void func name { void ) \ 
( \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r3 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTOjvO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area[ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACKJSIZE + 4], sp; 

#define ENTRY 1( func name, argO ) \ 
C PROTOTYPE 1( func name ) \ 
void func name ( long argO ) \ 
{ \ 

long CR[8]/ ulong CTR; ulong VSCR; long rO; \ 
AUTO r4 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save areaf 2*18 + 4 ] ; \ 
long vr save area[ 4*12 + 4 3; \ 
long stack [STACKJSIZE + 4], sp; 
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ftdefine ENTRY 2( func name, argO, argl ) 
C PROTOTYPE 2 ( func name ) \ 
void func name ( long argO, long argl 
{ \ 

long CR[8]; ulong CTR; ulong VSCR; 
AUTO r5 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area[ 4*12 +4 3; \ 
long stack [STACK_SIZE + 4], sp; 



2/23/2001 

\ 

) \ 

long rO; \ 



#define ENTRY 3( func name, argO, argl, arg2 ) \ 
C PROTOTYPE 3 ( func name ) \ 

void func_name ( long argO, long argl, long arg2 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area! 2*18 + 4 3 ; \ 
long vr save area[ 4*12 +43; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 4( func name, argO, argl, arg2, arg3 ) \ 
C PROTOTYPE 4 { f unc name ) \ 

void func name ( long argO, long argl, long arg2, long arg3 ) \ 

{ X V 

long CR[83; ulong CTR; ulong VSCR; long rO; \ 
AUTO r7 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTOjvO v31 \ 

long gpr save area[ 19 + 4 3 ; \ 
long fpr save areaf 2*18 + 4 3; \ 
long vr save area [ 4*12 + 4 3; \ 
long stack [S.TACK_SIZE + 43, sp; 

#define ENTRY 5{ func name, argO, argl, arg2, arg3, arg4 ) \ 
C PROTOTYPE 5( func name ) \ 

void func name ( long argO, long argl, long arg2, long arg3, long arg4 ) \ 

{ \ v ~ 

long CR[83; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 3 ; \ 
long fpr save area[ 2*18 +4 3; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 43, sp; 

#define ENTRY 6( func name, argO, argl, arg2, arg3, arg4, arg5 ) \ 
C PROTOTYPE 6( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5 ) \ 

{ \ 

long CR[83; ulong CTR; ulong VSCR; long rO; \ 
AUTO r9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area [ 19 + 4 ] ; \ 
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long fpr save area[ 2*18 + 4 ]; \ 
long vr save areat 4*12 +4 3; \ 
long stack [STACK_SIZE + 4] , sp; 

#define ENTRY_7 ( funcjiaine, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6 ) \ 
C PROTOTYPE 7( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6 ) \ 

{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO rlO r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 V31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area[ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY__8 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7 ) \ 
C PROTOTYPE 8( func name ) \ 

void func_jiame ( long argO, long argl, long arg2, long arg3 , long arg4, \ 
long arg5, long arg6, long arg7 ) \ 

{ \ 

long CR[8J; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rll r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 J ,- \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save areat 4*12 +4 ]; \ 
long stack[STACK_SIZE + 4], sp; 

#define ENTRY_9 { func^name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 ) \ 
C PROTOTYPE 9( func name ) \ 

void func_name { long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl2 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO^vO v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 +4 3; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_10 ( func__name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9 ) \ 
C PROTOTYPE 10 { func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
^ ^ long arg5, long arg6, long arg7, long arg8, long arg9 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl3 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 J; \ 
long fpr save area [ 2*18 +4 3; \ 
long vr save area [ 4*12 + 4 3; \ 
long stack [STACK_SIZE + 4], sp; 
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#define ENTRY_11 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO ) \ 
C PROTOTYPE 11 ( func name ) \ 

void func__name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO ) \ 

{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl4 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 ] ; \ 
long fpr save areaf 2*18 + 4 ] ; \ 
long vr save area [ 4*12 +4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_JL2 ( func_name f argO, argl, arg2, arg3 , arg4, argS, \ 

arg6, arg7, arg8, arg9, arglO, argil ) \ 
C PROTOTYPE 12 ( func name ) \ 

void func__name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long argS, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil ) \ 

{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl5 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 V31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ],- \ 
long stack [STACK_SIZE + 4] , sp; 

#define ENTRY_13 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2 ) \ 
C PROTOTYPE 13 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 

long arg5, long arg6, long arg7, long arg8, long arg9, \ 
^ long arglO, long argil, long argl2 ) \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 ] ; \ 
long fpr save areaf 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4] , sp; 

#define ENTRY_14 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl 2, argl 3 ) \ 
C PROTOTYPE 14 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 

long arg5, long arg6, long arg7, long arg8, long arg9, \ 
^ long arglO, long argil, long argl2, long argl3 ) \ 

long CR[8j ; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl7 r31 \ 
AUTO £0 f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area[ 19 + 4 ]; \ 
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long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 +4 ]; \ 
long stack[STACK_SIZE + 4], sp; 

#define ENTRY_15( func_name, argO, argl, arg2, arg3 f arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 
C PROTOTYPE 15 ( f unc name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil, long argl2, long argl3, \ 
long argl 4 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO r!8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area[ 19 + 4 ] ; \ 
long fpr save area[ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4] , sp; 

#define ENTRY_1 6 ( func^name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4, arglS ) \ 
C PROTOTYPE 16 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8 , long arg9, \ 
long arglO, long argil, long argl2, long argl3, \ 
long argl4, long argl5 ) \ 

long CR[8] ; ulong CTR; ulong VSCR; long r0; \ 
AUTO rl9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area[ 2*18 +43; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

/* 

* macros to get GPR arguments beyond 8 
V 

#def ine GET ARG8 ( rD ) 
#define GET ARG9 ( rD ) 
#def ine GET ARG10 ( rD ) 
#define GET ARG11 ( rD ) 
#def ine GET ARG12 ( rD ) 
#def ine GET ARG13 ( rD ) 
#def ine GET ARG14 { rD ) 
#define GET ARG15 ( rD ) 
#def ine. GET ARG16 ( rD ) 
#define GET_ARG17 ( rD ) 

/* 

* macros to set GPR arguments beyond 8 
*/ 

#define SET ARG8 ( rD ) 
#define SET ARG9 ( rD ) 
#def ine SET ARG10 ( rD ) 
fldefine SET ARG11 ( rD ) 
#def ine SET ARG12 ( rD ) 
#def ine SET ARG13 ( rD ) 
#def ine SET ARG14 { rD ) 
#define SET__ARG15 ( rD ) 
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^define SET ARG16 ( rD ) 
#define SET_ARG17 ( rD ) 

/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC( funcjiame ) \ 
f unc_name { ) ; \ 

/* 

* macros to call functions 
V 

#define CALL_FUNC{ funcjiame ) \ 
funcjiame { ) ; 

/* 

* macros to call functions 
*/ 

#def ine CALL_0 ( f unc_name ) \ 
f unc_name ( ) ; 

#define CAliIi_l( func name, argO ) \ 
f uncjiame ( argO ) ; 

#define CALLJ2 ( funcjiame, argO, argl ) \ 
funcjiame { argO, argl ) ; 

#define CAIjL_3 ( funcjiame, argO, argl, arg2 ) \ 
funcjiame ( argO, argl, arg2 ); 

#define CALL_4 ( funcjiame, argO, argl, arg2, arg3 ) \ 
funcjiame ( argO, argl, arg2, arg3 ) ; 

#define CALL_5 ( funcjiame, argO, argl, arg2, arg3, arg4 ) \ 
funcjiame ( argO , argl , arg2 , arg3 , arg4 ) ; 

#define CALL_6 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5 ) \ 
funcjiame ( argO, argl, arg2, arg3, arg4, arg5 ); 

ftdefine CALL_7 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 
funcjiame ( argO , argl , arg2 , arg3 , arg4 , arg5 , arg6 ) ; 

#define CAIiL__8 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
funcjiame ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ); 

#define CALL_9 { funcjiame, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8 ) \ 

func__name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8 ) ; 

ttdefine CALLJL0{ func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9 ) \ 

funcjiame { argO, argl, arg2, arg3 f arg4, arg5, arg6, arg7, \ 
arg8, arg9 ) ; 

#define CALLJL1( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) \ 
funcjiame ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) ; 

#define CALL_12( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil ) \ 
funcjiame ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil ); 

ttdefine CALL_13 ( func name, argO, argl, arg2, arg3, arg4 f arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ) \ 
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func_name ( argO, argl, arg2, arg3, arg4, arg5 # arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ); 

#define CALL_14 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argi3 ) ; 

#define CALL_15( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, arg!4 ) \ 
funcjtiame ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4 ) ,- 

#define CALL_JL6( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, argl5 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 

arg8, arg9, arglO, argil, argl2, arg!3, argl4, arg!5 ) ; 

#if defined ( BUILD_MAX ) 
/* 

* G4 macros to create a dummy jump table. 

* (not supported in C) 
*/ 

#define DECLARE VMX VI ( root name 
#define DECLARE VMX V2 ( root name 
ftdefine DECLARE VMX V3 ( root name 
#define DECLARE VMX V4 { root name 
#define DECLARE_VMX_V5 ( root_name 

#define DECLARE VMX Zl { root name 

#def ine DECLARE VMX Z2 ( root name 

# define DECLARE VMX Z3 { root name 

#def ine DECLARE VMX Z4 ( root name 

#define DECLARE__VMX_Z5 ( root name 



* G4 macros to decide whether to enter a VMX loop 

* VMX loop is entered if at least minimum count, 

* all vectors have the same relative alignment 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddxO can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (not supported in C) 
*/ 

ftdefine BR IF VMX VI { root name, min n imm, unit s imm, pi, si, n, eflag ) 
#define BR_IF__VMX_V1_ALIGNED ( root name, min n_imm, unit_s_imm, \ 

pi, si, n, eflag ) 
#define BR_IF_VNK_V2 ( root name, min n imm, unit__s__imm, \ 

pi, si, p2, s2, n, eflag ) 
#define BR_IF_VMX_V2_LS ( root name, min n imm, unit_s_imm, \ 

pi, si, ps, s2, n, eflag ) 
#define BR_IF__VMX_V2_LC( root name, min_n imm, unit_s_imm, \ 

pi, si, pc, n, eflag ) 
#define BR_IF_VMX_V2_ALIGNED( root name, min n imm, unit_s_imm, \ 

pi, si, p2, s2, n, eflag ) 
#def ine BR_IF__VMX_V3 ( root name, min n imm, unites imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) 
ttdefine BR_IF_VMX_V3_ALIGNED( root name, min n imm, unit_s imm, \ 

pi, si, p2, s2, p3, s3, n, eflag ) 
#define BR_IF__VMX_V4 ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) 
#define BR_IF_VMX_V4_ALIGNED( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, S3, p4, s4, n, eflag ) 
#define BR_IF_VMX_V5 ( root_name, min_n_imm, unit_s_imm, \ 



380 



WO 02/073937 



PCT/US02/08106 



salppc.h 2/23/2001 

pi, si, p2, s2, p3, S3, p4, s4, p5, s5, n, eflag ) 
#define BR_IF_VMX_V5_ALIGNED ( root name, min n imm, unit s imm, \ 

pl, si, p2, s2, p3, S3, p4, S4, p5, s5, n, 
eflag ) 

#define BR_IF_VMX_Z1 ( root_name, min n_imm, unit_s_imm, \ 

prl, pil, si, n, eflag ) 
ttdefine BR_IF_VMX_Z2 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) 
#define BR_IF_VMX__Z3 { root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3 , s3, n, eflag ) 
idefine BR_IF__VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3 , pi3, s3, \ 

pr4, pi4, s4, n, eflag ) 
#define BR_IF_VMX_Z5 { root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
' pr4, pi4, s4, pr5, pi5, s5, n, eflag ) 
#define BR_IF__VMX_CONV< root name, min n imm, \ 

pl, si, s2, p3, S3, n, eflag > 
#define BR_IF_VMX_ZCONV( root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) 

7* 

* G4 macro to get VMX unaligned (FP) count 

* assumes all vectors have the same relative alignment 

* and that the last 2 bits of ptr are 0 

* sets condition code CRO 
*/ 

#define GET_VMX_UNAL I GNED_COUNT { count, ptr ) \ 

(count) = - (ptr) ; \ 
(count) = ( (count) » 2) & 3; \ 
^ CR[0] = (long) (count) ; \ 

/* 

* G4 macro to get VMX unaligned short count 

* assumes that the last bit of ptr is 0 

* sets condition code CRO 
V 

#define GET_VMX_UNALIGNED_COUNT_S ( count; ptr ) \ 

(count) = - (ptr) ; \ 
(count) = ( (count) >> 1) & 7; \ 
^ CRCO] = (long) (count) ; \ 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

#define GET_VMX_UNALIGNED_COUNT__C { count, ptr ) \ 

(count) = - (ptr) / \ 
(count) = (count) & 15; \ 
^ CR[0] = (long) (count) ; \ 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

^define SCALAR_SPLAT ( vt, vtmp, scalarp ) \ 

(vt)-f[0] = <vt).f[l] = (vt).f[2) = (vt).f[33 = *scalarp ; 

#endif /* end BUILD_MAX */ 

/* 

* cache (DCBT and DCBZ) macros. 
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tfdefine DCBT TRUE ( cond_bit, scratch ) \ 

CRt(condjDit)] = -1; /* true (<=0) */ 

^define DCBZ TRUE ( cond_bit, scratch ) \ 
DCBT_TRUE( condjbit, scratch ) 

#define DCBT FALSE ( cond_bit, scratch ) \ 

CR[(cond_bit)] =1? /* false (> 0) */ 

ttdefine DCBZ FALSE ( cond_bit, scratch ) \ 
DCBT_FALSE( cond__bit, scratch ) 

#define SET DCBT COND ( cond bit, cache bit, eflag, scratchl ) \ 
CR[(cond_bit)] = (eflag & (cachejbit) ) ; 

#define SET__DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, tmp3) \ 
CR[(cond_bit)] * (eflag & (cache_bit) ) ; 

ttdefine DCBT IF{ cond bit, rA, rB ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
{ DCBT { rA, rB ) } 

#define DCBZ IF( cond bit, rA, rB ) \ 
if ( CR[{cond bit)] <= 0 ) \ 
{ DCBZ( rA, rB ) } 

#define DCBT IF CACHABLE{ cond_bit, rA, rB ) \ 
DCBT_IF< cond_bit, rA, rB ) 

#define DCBZ IF CACHABLE ( cond_bit, rA, rB ) \ 
DCBZ_IF( cond_bit, rA, rB ) 

#define BR IF CACHABLE ( cond bit, label ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
goto label; 

#define BR IF NOT CACHABLE ( cond_bit, label ) \ 
if ( CR[(cond bit)] > 0 ) \ 
goto label; 

/* 

* ASIC macros 
*/ 

#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH CONTROL { mode, scratchl, scratch2 ) \ 
♦(volatile long * ) PREFETCH_CONTROL = (mode) ; 

#define LOAD MISCON B{ mode, scratchl, scratch2 ) \ 
♦{volatile long *)MISCONJ8 = (mode); 

#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
{ \ 

volatile long 1; \ 
i = * (volatile long *)MISCON_B; \ 
i &= PREFETCH MASK; \ 
i |= USE PREFETCH CONTROL; \ 
^ * (volatile long * ) PREFETCH__CONTROL » i; \ 

#else 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
ttdefine LOAD MISCON B{ mode, scratchl, scratch2 ) 
tfdefine RESET PREFETCH CONTROL ( scratchl, scratch2 ) 
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#endif 
/* 

* instruction macros 
*/ 

tfdefine ADD ( rD, rA, rB ) 
tfdefine ADD_C( rD, rA, rB ) 
(long) (rD) ; 

#define ADDI ( rD, rA # SIMM ) 
ttdefine ADDIC_C( rD, rA, SIMM ) 
rD) ; 

#define ADDIS { rD, rA, SIMM ) 
#define AND ( rA, rS, rB ) 
#define AND_C( rA, rS, rB ) 
(long) (rA) ; 

#define ANDC ( rA, rS, rB ) 
#define ANDC_C( rA, rS, rB ) 
(long) (rA) ; 

#define ANDI_C( rA, rS, UIMM ) 
rA) ; 

#define ANDIS_C( rA, rS, UIMM ) 

tfdefine BA( addr ) 
#define BCTR 
#define BEQ { label ) 
#define BEQ PLUS ( label ) 
tfdefine BEQ MINUS ( label ) 
#define BEQ CR( bit, label ) 
#define BEQ CR PLUS( bit, label ) 
^define BEQ CRJ4INUS ( bit, label ) 
tfdefine BEQLR 
tfdefine BEQLR PLUS 
#def ine BEQLR MINUS 
#define BEQLR CR( bit ) 
#define BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS ( bit ) 
#define BGE ( label ) 
#define BGE PLUS( label ) 
#define BGE MINUS ( label ) 
#define BGE CR( bit, label ) 
#define BGE CR PLUS ( bit, label ) 
#define BGE CR_MINUS( bit, label ) 
#define BGELR 
#define BGELR PLUS 
#def ine BGELR MINUS 
#define BGELR CR( bit ) 
#define BGELR CR PLUS ( bit ) 
#define BGELR CR MINUS ( bit ) 
#define BGT ( label ) 
#define BGT PLUS( label ) 
#define BGT MINUS ( label ) 
#define BGT CR( bit, label ) 
#define BGT CR PLUS( bit, label ) 
#define BGT CR_MINUS ( bit, label ) 
#define BGTLR 
#define BGTLR PLUS 
#def ine BGTLR MINUS 
tfdefine BGTLR CR( bit ) 
tfdefine BGTLR CR PLUS ( bit ) 
#define BGTLR CR MINUS ( bit ) 
#define BL( func name ) 
^define BLE { label ) 
#define BLE PLUS ( label ) 
ttdefine BLE MINUS ( label ) 
#define BLE CR( bit, label ) 
tfdefine BLE_CRJPLUS ( bit, label ) 
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(rD) = (rA) + (rB) ; 

(rD) = (rA) + (rB) ; CR[0] = 

(rD) = (rA) + (SIMM) ; 

(rD) = (rA) + (SIMM); CR[0] = (long) ( 

(rD) * (rA) + ( (SIMM) « 16) ; 

(rA) = (rS) & (rB) ; 

(rA) = (rS) & (rB) ; CR[0] = 

(rA) = (rS) & -(rB) ; 

(rA) = (rS) & -(rB); CR[0J = 

(rA) = (rS) & (UIMM); CR[0] = (long) ( 

(rA) = (rS) & ( (UIMM) « 16); \ 

CR[0] = (long) (rA) ; 
goto (addr) ; 

(*(void (*) (void) )CTR) () ; 
if ( CR[0] 0 ) goto label; 
BEQ ( label ) 
BEQ ( label ) 

if ( CR[(bit)] == 0 ) goto label; 

BEQ CR( bit, label ) 

BEQ CR( bit, label ) 

if (' CR[0] == 0 ) return; 

BEQLR 

BEQLR 

if ( CRE(bit)] == 0 ) return; 

BEQLR CR( bit ) 

BEQLR CR( bit ) 

if ( CR[0] >= 0 ) goto label; 

BGE ( label ) 

BGE ( label ) 

if ( CR[(bit)3 >= 0 ) goto label; 

BGE CR( bit, label ) 

BGE CR( bit, label ) 

if (' CR[0] >= 0 ) return,- 

BGELR 

BGELR 

if ( CR[(bit)] >= 0 ) return; 

BGELR CR( bit ) 

BGELR CR( bit ) 

if ( CR[0] > 0 ) goto label; 

BGT ( label ) 

BGT( label ) 

if { CR[(bit)] > 0 ) goto label; 

BGT CR( bit, label ) 

BGT CR( bit, label ) 

if (' CR[0] > 0 ) return; 

BGTLR 

BGTLR 

if ( CR[(bit)] > 0 ) return; 
BGTLR CR( bit ) 
BGTLR CR( bit ) 
f unc_name ( ) ; 

if ( CR[0] 0 ) goto label; 
BLE ( label ) 
BLE ( label ) 

if ( CR[(bit)] <= 0 ) goto label; 
BLE_CR( bit, label ) 
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#define BLE CR_MINUS( bit, label ) 

#define BLELR 

#define BLELR PLUS 

#define BLELR MINUS 

#define BLELR CR( bit ) 

#define BLELR CR PLUS { bit ) 

#define BLELR_CR_MINUS ( bit ) 

#define BLR 

#define BLT ( label ) 

#define BLT PLUS( label ) 

#define BLT MINUS ( label ) 

#define BLT CR( bit, label ) 

#define BLT CR PLUS{ bit, label ) 

#define BLT CR_MINUS ( bit, label ) 

#define BLTLR 

#define BLTLR PLUS 

#define BLTLR MINUS 

#define BLTLR CR( bit ) 

#define BLTLR CR PLUS ( bit ) 

#define BLTLR CR MINUS ( bit ) 

#define BNE ( label ) 

#define BNE PLUS( label ) 

#define BNE MINUS ( label ) 

#define BNE CR( bit, label ) 

#define BNE CR PLUS( bit, label ) 

#define BNE CR_MINUS ( bit, label ) 

#define BNELR 

#define BNELR PLUS 

#define BNELR MINUS 

#define BNELR CR( bit ) 

#define BNELR CR PLUS ( bit ) 

#define BNELR CR MINUS ( bit ) 

tfdefine BR( label ) 

#define CLRLWI ( rA, rS, nbits ) 

#define CLRLWI_C( rA, rS, nbits ) 

\ 

#define CLRRWI ( rA, rS, nbits ) 
#define CLRRWI_C{ rA, rS, nbits ) 

#define CMPLW( rA, rB ) 

tfdefine CMPLW CR{ bit, rA, rB ) 
? \ 

#define CMPLWI ( rA, UIMM ) 



#define CMPLWI_CR( bit, rA, UIMM ) 
3D) ? \ 



#define CMPW( rA, rB ) 

#define CMPW CR( bit, rA, rB ) 

#define CMPWI { rA, SIMM ) 

#define CMPWI_CR( bit, rA,. SIMM ) 

#define DCBF ( rA, rB ) 

#def ine DCBI ( rA, rB ) 

#define DCBST( rA, rB ) 

#define DCBT ( rA, rB ) 

#define DCBTST ( rA, rB ) 

#define DCBZ( rA, rB ) *{long *) 

*(long *) 
*(long *) 
*(long *) 
\ 



BLE CR( bit, label ) 

if ( CR[0] <= 0 ) return; 

BLELR 

BLELR 

if ( CR[(bit)] <= 0 ) return; 
BLELR CR( bit ) 
BLELR CR( bit ) 
return; 

if ( CR[0] < 0 ) goto label; 
BLT ( label ) 
BLT ( label ) 

if { CR[<bit)3 < 0 ) goto label; 

BLT CR( bit, label ) 

BLT CR( bit, label ) 

if ( CR[0] < 0 ) return; 

BLTLR 

BLTLR 

if ( CR[(bit)] < 0 ) return; 

BLTLR CR( bit ) 

BLTLR CR( bit ) 

if ( CR[0] 0 } goto label; 

BNE ( label ) 

BNE ( label ) 

if ( CR[(bit)] 1= 0 ) goto label; 

BNE CR( bit, label ) 

BNE CR( bit, label ) 

if ( CR[0] != 0 ) return; 

BNELR 

BNELR 

if ( CR[(bit)3 1= 0 ) return; 
BNELR CR( bit ) 
BNELR CR( bit ) 
goto label; 

(rA) = (rS) & ((1 « (32-nbits)) 
(rA) = (rS) & ((1 « (32-nbits)) 



- 1); 

- 1) ; 



CR[0] = (long) (rA) ; 
(rA) = (rS) & ~<(1 « nbits) - 1) ; 
(rA) » (rS) & ~<(1 « nbits) -1); \ 
CR[0] = (long) (rA) ; 
= ( ( (rA) A (rB) ) & (1 « 3D) ? > 



CRtO) 

( (rB) 
CR[(bit)] = 



(rA)) : ((rA) - (rB) ) ; 
( ( ( rA) A ( rB) ) & (1 « 3D) 



((rB) - (rA)) : ( (rA) - (rB) ) ; 
CR[0] = ( ( (rA) A (UIMM) ) & (1 « 31)) ? 

( (UIMM) - (rA)) : ((rA) - 
(UIMM) ) ; 

CRC(bit)] = ( ( (rA) A (UIMM) ) & (1 « 

( (UIMM) - (rA) ) : ( (rA) - 

(UIMM) ) ; 
CR[0] = (rA) - (rB); 
CR[(bit)] - (rA) - (rB) ; 
CR[0J = (rA) - (SIMM); 
CR[(bit)] = (rA) - (SIMM); 



(((rA) + (rB)) & ~ CACHE LINE MASK) = 0; \ 
( ( ( (rA) + (rB) ) & -CACHE LINE MASK) +4) = 0; \ 
( ( ( (rA) + (rB) ) & ~ CACHE LINE MASK) +8) = 0; \ 
( ( ( (rA) + (rB) ) & ~CACHE_LINE_MASK) +12) = 0; 
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#define DECR( rD ) 
ftdefine DECR C( rD ) 
#define DIVW( rD, rA, rB ) 
#define DIVW__C( rD, rA, rB ) 
(long) (rD) ; 

#define DIVWU( rD, rA, rB ) 
#define DIVWU_C( rD, rA, rB ) 

#define EQV( rA, rS, rB ) 
#define EQV_C( rA, rS, rB ) 



*(long *) (((<rA)+(rB)) 

*(long *) ((((rA)+(rB)) 
\ 

Mlong *) (((<rA)+(rB)) 
\ 

*(long *) U((rA) + (rB)) 
--(rD) ; 



& ~CACHE_LINE_MASK) +16) = 0 

& ~CACHE_LINE_MASK)+2 0) « 0 

& ~CACHE_LINE_MASK)+24) = 0 

& -CACHE LINE_MASK) +28) = 0 



(rD) ; CR[0] 



(rD) 
(rD) = 

(rD) s 
(rD) = 

(rA) : 
(rA) * 
CR[0] 
(frD) 



#define FABS ( frD, frB ) 
-(frB) ; 

#define FADD ( frD, frA, frB ) (frD) 
#define FADDS ( frD, frA, frB ) (frD) 
#define FCMPO( bit, frA, frB ) \ 

if ( (frA) < (frB) ) CR[(bit)3 = -1; \ 
else if ( (frA) > (frB) ) CR[(bit)] 
else CR[(bit)] = 0; \ 

#define FCMPU( bit, frA, frB ) 
#define FCTIW( frD, frB ) 
#define FCTIWZ( frD, frB ) \ 

union { \ 

long i[2] \ 

double d; \ 
} u ? \ 

u.i[0] = (long) (frB) ; \ 
u.i[l] = 0; \ 
(frD) = u.d; \ 

#define FDIV( frD, -frA, frB ) 
#define FDIVS( frD, frA, frB ) 
#define FMADD ( frD, frA, frC, frB) 
#define FMADDS ( frD, frA, frC, frB) 
#define FMOV( frD, frB ) 
#define FMR( frD, frB ) 
tfdefine FMUL ( frD, frA, frB ) 
#define FMULS( frD, frA, frB ) 
#define FMSUB ( frD, frA, frC, frB ) 
#define FMSUBS ( frD, frA, frC, frB ) 
#define FNABS ( frD, frB ) 
(frB) ; 

#define FNEG ( frD, frB ) 
#define FNMADD ( frD, frA, frC, frB ) 
#define FNMADDS ( frD, frA, frC, frB ) 
#define FNMSUB { frD, frA, frC, frB ) 
#define FNMSUBS ( frD, frA, frC, frB ) 
#define FRES ( frD, frB ) 
#define FRSP( frD, frB ) 
#define FRSQRTE ( frD, frB ) 
#define FSEL ( frD, frA, frC, frB ) 
(frB); 

#define FSUB( frD, frA, frB ) 
#define FSUBS( frD, frA, frB ) 
#define GOTO( label ) 
#define INCR( rD ) 
#define INCR_C( rD ) 



(long) (rD) ; 
(rA) / (rB); 
(rA) / (rB) ; CR[0] = 

(ulong) (rA) / (ulong) (rB) ; 
(ulong) (rA) / (ulong) (rB) ; 
CR[0] = (long) (rD) ; 
~((rS) A (rB)) ; 
~((rS) A (rB)) ; \ 
= (long) (rA) ; 

= ((frB) >= 0.0) ? (frB) : 



(frA) 
(frA) 



(frB) ; 
(frB); 



1; \ 



FCMPO( bit, frA, frB ) 



(frD) 




(frA) / 


(frB) ; 




(frD) 




(frA) / 


(frB) ; 




(frD) 




(frA) * 


(frC) + 


(frB) ; 


(frD) 




(frA) * 


(frC) + 


(frB) ; 


(frD) 




(frB) ; 






(frD) 




(frB) ; 






(frD) 




(frA) * 


(frB) ; 




(frD) 




(frA) * 


(frB) ; 




(frD) 




(frA) * 


(frC) - 


(frB) ; 


(frD) 


S3 


(frA) * 


(frC) - 


(frB) ; 


(frD) 




((frB) >= 0.0) ? 


- (frB) 


(frD) 




-(frB); 






(frD) 




-((frA) 


* (frC) 


+ (frB)) 


(frD) 




-((frA) 


* (frC) 


+ (frB)) 


(frD) 




-((frA) 


* (frC) 


- (frB)) 


(frD) 




-((frA) 


* (frC) 


- (frB)) 


(frD) 




(float) (frB) ; 




(frD) 




((frA) > 


= 0.0) ? 


(frC) : 


(frD) 




(frA) - 


(frB) ; 




(frD) 




(frA) - 


(frB); 




BR ( label ) 






++(rD) ; 








++(rD) 


i 


CR[0] = 


(long) (rD) ; 
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#define LA( rD, symbol, SIMM ) 
^define LABEL ( label ) 
#define LBZ( rD f rA, d ) 
#define LBZA ( rD, symbol ) 
#define LBZU( rD, rA, d ) 
#define LBZUX( rD, rA, rB ) 
tfdefine LBZX( rD, rA, rB ) 
#define LFD ( frD, rA, d ) 
#define LFDU ( frD, rA, d ) 
#define LFDUX( frD, rA, rB ) 
#define LFDX ( frD, rA, rB ) 
#define LFS( frD, rA, d ) 
ftdefine LFSA { frD, symbol, rT ) 
#define LFSU( frD, rA, d ) 
#define LFSUX( frD, rA, rB ) 
#define LFSX{ frD, rA, rB ) 
#define LHA { rD, rA, d ) 
#define LHAA ( rD, symbol ) 
#define LHAU ( rD, rA, d ) 
#define LHAUX( rD, rA, rB ) 
#define LHAX ( rD, rA, rB ) 
#define LHZ( rD, rA, d ) 
#define LHZA ( rD, symbol ) 
ttdefine LHZU ( rD, rA, d ) 
#define LHZUX( rD, rA, rB ) 
#define LHZX( rD, rA, rB ) 
#define LI ( rD, SIMM ) 
#define LIS< rD, SIMM ) 
#define LOAD_COUNT< rD ) 
#define LWZ( rD, rA, d ) 
#define LWZA( rD, symbol ) 
#define LWZU ( rD, rA, d ) 
#define LWZUX( rD, rA, rB ) 
#define LWZX( rD, rA, rB ) 
#define MCRF ( crfD, crfS ) 
#define MCRFS< crfD, crfS ) 
#define MFCR ( rD ) 
#define MFCTR{ rD ) 
#define MFLR ( rD ) 
#define MFSPR( rD, SPR ) 
#define MOV( rA, rS ) 
#define MOV_C( rA, rS ) 
#define MR ( rA, rS ) 
#define MR C{ rA, rS ) 
#define MTCR ( rD ) 
#define MTCTR ( rD ) 
#define MTFSFI ( crfD, I MM ) 
#define MTLR ( rD ) 
#define MTSPR( SPR, rS ) 
#define MULLI ( rD, rA, SIMM ) 
#define MULLW( rD, rA, rB ) 
#define MULLW_C( rD, rA, rB ) 
(long) (rD) ;. 

#define NAND ( rA, rS, rB ) 
#define NAND_C( rA, rS, rB ) 
rA) ; 

ftdefine NEG { rD, rA ) 
#define NEG_C( rD, rA ) 
#define NOP 

#define NOR( rA, rS, rB ) 
#define NOR_C( rA, rS, rB ) 
rA) ; 

#define OR( rA, rS, rB ) 
#define OR C( rA, rS, rB ) 
(long) (rA) , 

#define 0RC( rA, rS, rB ) 
#define ORC_C( rA, rS, rB ) 



2/23/2001 

(rD) = (long) &( symbol ) ; 
label: 

(rD) « *(uchar *)((rA) + (d) ) ; 
(rD) s *(uchar *)& (symbol); 
(rD) - *(uchar *)((rA) (d) ) ; 
(rD) = *(uchar *) ( (rA) (rB) ) ; 
(rD) = Muchar *) ( (rA) + (rB)),- 
(frD) = * (double *) ( (rA) + (d) ) ; 
(frD) a * (double *) ( (rA) (d) ) ; 
(frD) - * (double *) ( (rA) += (rB) ) ; 
(frD) = * (double *) ( (rA) + (rB) ) ; 
(frD) = * (float *> ((rA) + (d) ) ; 
(frD) = * (float *)& (symbol); 
(frD) = * (float *)((rA) += (d) ) ; 
(frD) = * (float *)((rA) += (rB) ) ; 
(frD) = * (float *)((rA) + (rB) ) ; 
(rD) = * (short *) ( (rA) + (d) ) ; 
(rD) » * (short *)&(symbol); 
(rD) = * (short *){(rA) +- (d) ) ; 
(rD) = *(short *) ( (rA) += (rB) ) ; 
(rD) = * (short *) ( (rA) + (rB) ) ; 
(rD) = *(ushort *) ( (rA) + (d) ) ; 
(rD) m *(ushort *)&(symbol); 
(rD) - *(ushort *) ( (rA) += (d) ) ; 
(rD) « *(ushort *)(<rA) += (rB) ) ; 
(rD) » *(ushort *) ( (rA) + (rB) ) ; 
(rD) = (SIMM) ; 
(rD) = ((SIMM) « 16) ; 
CTR = (rD) ; 

(rD) = *(long *) ( (rA) + (d) ) ; 
(rD) = *(long +)& (symbol); 
(rD) = Mlong *) ( (rA) += (d) ) ; 
(rD) = *(long *)((rA) +« (rB) ) ; 
(rD) = Mlong *) ( (rA) + (rB) ) ; 



(rA) = (rS); 

(rA) = (rS); CR[0] = (long) (rA) ; 

(rA) = (rS); 

(rA) = (rS); CR[0] = (long) (rA) ; 



(rD) 




(rA) * 


(SIMM) ; 


(rD) 




(rA) * 


(rB) ; 


(rD) 




(rA) * 


(rB) ; CR[0] = 


(rA) 




~UrS) 


& (rB)); 


(rA) 




~<(rS) 


& (rB)) ; CR[0] = (long) ( 


(rD) 




-(rA) ; 




(rD) 




- (rA) ; 


CR[03 = (long) (rA) ; 


(rA) 




~((rS) 


1 (rB)); 


(rA) 




~((rS) 


| (rB)); CR[0] = (long) ( 


(rA) 




(rS) | 


(rB) ; 


(rA) 


S3 


(rS) | 


(rB) ; CR[0} « 


(rA) 




(rS) | 


-(rB) ; 


(rA) 


= 


(rS) | 


~(rB) ; CR[0] = 
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(UIMM) ; 

( (UIMM) « 16) ; 



(ME)); \ 

- (SH) ) ) ) & mask) ; \ 



(long) (rA) ? 

#define ORI ( rA, rS, UIMM ) (rA) = (rS) 

#define ORIS ( rA, rS, UIMM ) (rA) = (rS) 

#define RETURN BLR 
#define RLWIMI ( rA, rS, SH, MB, ME ) \ 

ulong mask; \ 

mask = ((1 « ( (ME) - (MB) +1)) - 1) << (31 - 
(rA) &= -mask; \ 
^ (rA) |= ((((rS) « (SH)) | ( (ulong) (rS) » (32 

#define RLWIMI C( rA, rS, SH f MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) +1)) - 1) « (31 - (ME)); \ 
(rA) &= -mask; \ 

(rA) |= ((((rS) « (SH) ) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask) 
^ CR[0] = (long) (rA) ; \ 

#de£ine RLWINM ( rA, rS, SH, MB, ME ) \ 

ulong mask; \ 

mask = ((1 « ((ME) - (MB) + 1) ) - 1) « (31 - (ME)); \ 
^ (rA) = (((rS) « (SH)) | ( (ulong) (rS) » (32 - (SH)))) & mask; \ 

#define RLWINM C( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 « ( (ME) - (MB) +1)) - 1) « (31 - (ME)); \ 
(rA) = (((rS) « (SH)) | ( (ulong) (rS) » (32 - (SH) ) ) ) & mask; \ 
^ CR[03 = (long) (rA) ; \ 

#define RLWNM ( rA, rS, rB, MB, ME ) 
#define RLWNM_C( rA, rS, rB, MB, ME ) 



) 

#define EXTLWI ( rA, rS, n, b ) 
#define EXTLWI C( rA, rS, n, b ) 
#define EXTRWI (" rA, rS, n, b ) 
#define EXTRWI C( rA, rS, n, b ) 
ttdefine INSLWI ( rA, rS, n, b ) 
) 

#define INSLWI_C( rA, rS, n, b ) 
<b) + (n)-l ) 

#define INSRWI ( rA, rS, n, b ) 
+ (n)-l ) 

#define INSRWI_C( rA, rS, n, b ) 
b) + (n)-l ) 

#define ROTLW( rA, rS, rB ) 
#define ROTLW C( rA, rS, rB ) 
#define ROTLWI ( rA, rS, n ) 
#define ROTLWI C( rA, rS, n ) 
#define ROTRWI ( rA, rS, n ) 
#define ROTRWI C( rA, rS, n ) 
#define SLW( rA, rS, rB ) 
#define SLW_C< rA, rS, rB ) 
(long) (rA) ; 

#define SLWI ( rA, rS, SH ) 
#define SLWI_C( rA, rS, SH ) 
(long) (rA) ; 

#define SRAW( rA, rS, rB ) 
ftdefine SRAW_C( rA, rS, rB ) 
long) (rA) ; 

#define SRAWI ( rA, rS, SH ) 
#define SRAWI_C( rA, rS, SH ) 
long) (rA) ; 

#define SRW( rA, rS, rB ) 
#define SRW_C( rA, rS, rB ) 



RLWINM ( rA, rS, (rB) & Oxlf, MB, ME ) 
RLWINM_C( rA, rS, (rB) & Oxlf, MB, ME 



RLWINM ( rA, rS, (b) , 0, (n)-l ) 
RLWINM C( rA, rS, (b) , 0, (n)-l ) 
RLWINM ( rA, rS, (b) + (n) , 32- (n) , 31 ) 
RLWINM ( rA, rS, (b) + (n) , 32- (n) , 31 ) 
RLWIMI ( rA, rS, 32- (b) ( (b) , (b)+(n)-l 

RLWIMI_C( rA, rS, 32-(b), (b) , 

RLWIMI ( rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

RLWIMI_C( rA, rS, 32- ( (b) + (n) ) , (b) , ( 

RLWNM ( rA, rS, rB, 0, 31 ) 
RLWNM C( rA, rS, rB, 0, 31 ) 
RLWINM ( rA, rS, (n) , 0, 31 ) 
RLWINM C( rA, rS, (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n), 0, 31 ) 
(rA) = (rS) « (rB) ; 
(rA) = (rS) « (rB) ; CR[0] = 

(rA) = (rS) « (SH) ; 

(rA) - (rS) « (SH) ; CR[0] = 

(rA) = (long) (rS) » (rB) ,- 

(rA) = (long) (rS) » (rB) CR[0] = ( 

(rA) = (long) (rS) » (SH) ; 

(rA) = (long) (rS) » (SH) ; CR[0] = ( 

(rA) = (ulong) (rS) » (rB) ; 

(rA) = (ulong) (rS) » (rB) ; CR[0] - ( 
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#define SRWI ( rA, rS, SH ) 
#define SRWI_C( rA, rS, SH ) 
long) (rA) ; 

#define STB( rS, rA, d ) 
#define STBU( rS, rA, d ) 
#d'efine STBUX( rS, rA r rB ) 
ttdefine STBX( rS, rA, rB ) 
#define STFD( frD, rA, d ) 
#define STFDU( frD, rA, d ) 
#define STFDUX( frD, rA, rB ) 
#define STFDX( frD, rA, rB ) 
#define STFS( frD, rA, d ) 
tfdefine STFSU( frD, rA, d ) 
#define STFSUXC frD, rA, rB ) 
#define STFSX( frD, rA, rB ) 
#define STH( rS, rA, d ) 
#define STHD ( rS, rA, d ) 
#define STHUX( rS, rA, rB ) 
#define STHX( rS, rA, rB ) 
#define STW( rS, rA, d ) 
tfdefine STWU( rS, rA, d ) 
#define STWUX( rS, rA, rB ) 
#define STWX( rS, rA, rB ) 
#define SUB( rD, rA, rB ) 
#define SUB_C( rD, rA, rB ) 
(long) (rD) ; 

#define SUBFIC( rD, rA, SIMM ) 
ttdefine SUBI ( rD, rA, SIMM ) 
#define SUBIC_C( rD, rA, SIMM ) 
rD) ; 

#define SUBIS( rD, rA, SIMM ) 
#define TEST__COUNT( label ) 
#define XOR( rA, rS, rB ) 
#define XOR_C{ rA, rS, rB ) 
(long) (rA) ; 

^define XORI ( rA, rS, UIMM ) 
#define XORIS ( rA, rS, UIMM ) 

#if defined ( BUILD_MAX ) 
/* 

* VMX instructions 
*/ 

#define BR VMX ALL TRUE ( label ) 
#define BR VMX ALL FALSE ( label ) 
#define BR VMX NONE TRUE ( label ) 
#define BR VMX SOME FALSE ( label ) 
#define BR_VMX_SOME_TRUE ( label ) 

#define DSS( STRM ) 

#define DSSALL 

ttdefine DST( rA, rB, STRM ) 

#define DSTT( rA, rB, STRM ) 

#define DSTST( rA, rB, STRM ) 

#define DSTSTT ( rA, rB, STRM ) 

#if defined ( COMPILE NON_ALIGNED ) 

#define VMX_ADDR_MASK 0 

#else 

#define VMX_ADDR_MAS K 15 
#endif 

#if defined ( COMPILE_LVX_CHARS ) 

#define LVX( vT, rA, rB ) \ 
{ \ 
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(rA) = (ulong) (rS) » (SH) ; 

(rA) - (ulong) (rS) » (SH) ; CR[0] = ( 

*(char ♦) ((rA) + <d) ) = (rS) ; 
♦(char *)({rA) (d) ) = (rS) ; 
♦(char ♦M(rA) += (rB) ) = (rS) ; 
Mchar *)((rA) + (rB) ) = (rS) ; 
♦{double ♦H(rA) + (d) ) = (frD); 
♦(double *)({rA) += (d) ) - (frD); 
♦(double *)((rA) += (rB) ) = (frD) ; 
♦(double ♦M(rA) + (rB) ) = (frD); 
♦(float *)((rA) + (d)) = (frD); 
* (float ♦)({rA) += (d)) = (frD); 
* (float *)({tA) += (rB)) = (frD); 
♦(float *)((rA) + (rB)) = (frD); 
♦(short ♦)((rA) + (d) ) = (rS) ; 
♦(short ♦)((rA) += (d) ) = (rS) ; 
* (short *)((xA) (rB) ) = (rS) ; 
♦(short *)((rA) + (rB) ) = (rS) ; 
♦(long ♦) ((rA) + (d) ) = (rS) ; 
♦(long ♦H(rA) += (d) ) = (rS) ; 
♦(long ♦)((rA) += (rB) ) = (rS) ; 
♦{long ♦X(rA) + (rB) ) = (rS) ; 
(rD) = (rA) - (rB) ; 
(rD) = (rA) - (rB) ; CR[0] = 

(rD) = (SIMM) - (rA) ; 
(rD) = (rA) - (SIMM) ; 

(rD) = (rA) - (SIMM); CR[0] = (long) ( 

(rD) m (rA) - {(SIMM) « 16); 

if ( --CTR ) goto label; 

(rA) = (rS) A (rB); 

(rA) = (rS) A (rB) ; CR[0j = 

(rA) = (rS) A (UIMM) ; 

(rA) = (rS) A {(UIMM) « 16); 



if( CR[6] & 0x8 ) goto label; 

if ( CR[6] & 0x2 ) goto label; 

if ( CR [6] & 0x2 ) goto label; 

if( !(CR[6] & 0x8) ) goto label; 

if( !(CR[6] & 0x2) ) goto label; 
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char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 

(vT).c[C INDEX MUNGE ( i )] = addr[i]; \ 

} " x 

#define LVEBX( VT, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 

i = (ulong) addr & VMX_ADDR_MASK; \ 

(vT) . c [C_INDEX_MUNGE ( i )] = addr [03? \ 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(< (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).c[C INDEX MUNGE ( i )] - addr[0]; \ 
(vT).c[C INDEX MUNGE ( i + 1 )] = addr[l]; \ 

} " 
tfdefine LVEWX( vT, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & -3) ; \ 
i. = (ulong) addr & VMX_ADDR_MASK ; \ 
(vT).c[C INDEX MUNGE ( i )] = addr [03; \ 
(VT).c[C INDEX MUNGE ( i + 1 )3 = addr [13; \ 
(vT).c[C INDEX MUNGE ( i + 2 )] = addr [2]; \ 
(vT).c[C INDEX MUNGE { i + 3 )] = addr[3]; \ 

} 

#ellf defined ( COMPILE JjVX^SHORTS ) 

#define LVX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX__ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

(vT).s[S INDEX MUNGE ( i )] = addr[i3; \ 

} " 
#define LVEBX( vT, rA, rB ) \ 

{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) ( (ulong) (rA) + (ulong) (rB) ) ; \ 

i = (ulong) addr & VMX_ADDR_MAS K ; \ 

(vT) . c [C_INDEX_MUNGE ( i ) 3 - addr [0] ; \ 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) .s[S_INDEXJVIUNGE( i ) ] - addr [03; \ 

#define LVEWX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) . s [S_INDEX MUNGE ( i )3 = addr [03; \ 
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(vT).s[S INDEX MUNGE { i + 1 )] = addr[l]; \ 

} 

#else 

#define LVX( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ 

(vT) . 1 [L_INDEX_MUNGE ( i )3 = addr[ij; \ 

tfdefine LVEBX( vT, rA, rB ) \ 

char *addr; \ 
ulong i; \ 

addr « (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 

i = (ulong) addr & VMX_ADDR_MAS K ; \ 

(vT) . C [C_INDEX_MUNGE ( i )] = addr[0]; \ 

#define LVEHX( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *) ( ( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) >> 1; \ 
(vT).s[S INDEX MUNGE ( i )] « addr[0]; \ 

} 

#define LVEWX( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 2; \ 
(vT).l[L INDEX MUNGE ( i )] = addr[0] ; \ 

} 

#endif 

#if defined ( COMPILE_JSTVX_CHARS ) 

#define STVX( vS, rA, rB ) \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR__MASK) ; \ 
for { i = 0; i < 16; i++ ) \ 

addrli] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

#define STVEBX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr » (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i « (ulong) addr & VMX ADDR MASK; \ 
addr[0] « (vS) . c [C_INDEX_MUNGE ( i )3; \ 

#define STVEHX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) ' + (ulong) (rB) ) & -1); \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE { i )]; \ 
addr[l] = (vS).c[C INDEX MUNGE ( i + 1 )3; \ 

} 
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#define STVEWX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addrlO] = (vS) .c[C INDEX MUNGE ( i )]; \ 
addrCl] = (vS) .c[C INDEX MUNGE ( i + 1 )]; \ 
addr[23 = (vS) .c[C INDEX MUNGE ( i + 2 )]; \ 
addr [33 = (vS) .c[C INDEX MUNGE ( i + 3 )]; \ 

} 

#elif defined ( COMPILE_STVX_SHORTS ) 

tfdefine STVX( vS, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

addrti] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 

#define STVEBX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr « (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS) .c [C_INDEX_MUNGE( i )]; \ 

ttdefine STVEHX( vS, rA f rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr - (short *)(( (ulong) (rA) + (ulong) (rB) ) & -1); \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr [0] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 

#define STVEWX ( vS , rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr[0] = (vS).s[S INDEX MUNGE ( i )]; \ 
addr[l] = (vS) . s [S__INDEX__MUNGE ( i + 1 )]; \ 

#else 

#define STVX( vS, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ 
^ addrCi] = (vS) .1 [L_INDEX_MUNGE < i )]; \ 

#define STVEBX{ vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) ((ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS) . c fC_INDEX_MUNGE ( i ) ] ; \ 

#define STVEHX( vS, rA, rB ) \ 
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} 



short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr [0] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 



#define STVEWX( vS, rA, rB ) \ 

{ v 

long *addr; \ 
ulong i; \ 

addr = (long *>(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ( (ulong) addr & VMX ADDR MASK) » 2 ; \ 
addr[0] = (vS).l[L INDEX MUNGE ( i )]; \ 

} 

#endif 

#define LVSL BE ( vT, rA, rB ) \ 
{ \ 

ulong i, j; \ 

j = ( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MAS K ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc[i] - j + i; \ 

} 

ftdefine LVSR BE ( vT, rA, rB ) \ 
{ \ 

ulong i, j; \ 

j = 16 - (( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK) ; \ 
for ( i - 0; i < 16; i++ ) \ 
(vT) .uc[i] = j + i; \ 



LVSR BE ( vT, rA, rB ) ; 

LVSL_BE( vT, rA, rB ) ; 

LVSL BE ( vT, rA, rB ); 

LVSR_BE( vT, rA, rB ) ; 



LVX( vT, rA, rB ) 
STVX( vS, rA, rB ) 



} 

#if defined ( LITTLE ENDIAN ) 
#define LVSL ( vT, rA, rB ) 
#define LVSR( vT, rA, rB ) 
#else 

#define LVSL( vT, rA, rB ) 
#define LVSR( vT, rA, rB ) 
#endif 

#define LVXL( vT, rA, rB ) 
#define STVXL( vS, rA, rB ) 
#define VADDFP ( vT, vA, vB ) \ 

ulong i; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) . f [ij ; \ . 

b = (vB) .f [ij; \ 

c = a + b; \ 

(vT) .f [i] = C; \ 

, M 

#define VADDSBS( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c[i] + (long) (vB) .c[i] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i] = 127; \ 
else (vT).c[i] = (char) itemp; \ 

} 1 X 

#define VADDSHS ( vT, vA, vB ) \ 
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ulong i; \ 
long itemp; \ 

for { i = 0; i < 8? i++ ) { \ 

itemp = (long) (vA) .s [i] + (long) (vB) . s [i] ; \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
else if ( itemp > 32767 ) (vT).s[i] « 32767; \ 
else (vT).sli] = (short) itemp; \ 

» ' 

#define VADDSWS< vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i » 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] + (vB).l[i] ; \ 

if ( ( (vA).l[i] > 0) && ( (vB).l[i] > 0) && (itemp < 0) ) \ 

(vT).l[i] = (long)0x7fffffff ; \ 
else if ( ( (vA).lfi] < 0) && ( (vB).l[i] < 0) && (itemp > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT) .1 = itempEi]; \ 

} 1 x 

#define VADDOBM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = (vA).uc[i] + (vB).uc[i]; \ 

} 

#define VADDUBS ( vT, vA, vB ) \ 
{ \ 

ulong i, itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (ulong) (vA) -uc [i] + (ulong) (vB) ,uc [i] ; \ 

if ( itemp > 255 ) <vT).uc[i] = 255; \ 
^ ^else (vT).uc[i] = (uchar) itemp; \ 

#deline VADDUHM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] + (vB).us[i]; \ 

#define VADDUHS{ vT, vA, vB ) \ 
{ \ 

ulong i # itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (ulong) (vA) .us [i] + (ulong) (vB) .us [i] ; \ 

if ( itemp > 65535 ) (vT).uc[i] = 65535; \ 

else (vT) .ucCi] = (ushort) itemp; \ 

} } x 

#define VADDUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).ul[i] = (vA).ul[i] + (vB).ul[i] ; \ 

tfdefine VADDUWS( vT, vA, vB ) \ 
{ \ 

ulong i, itemp; \ 

for ( i a 0; i < 4; i++ ) { \ 

itemp = (vA).ul[i] + (vB).ul[i]; \ 

if ( itemp < (vA).ul[ij ) (vT).ul[i] = (ulong) Oxfffff fff; \ 
else (vT).ul[i] = itemp? \ 
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#define VAND ( vT, vA, vB > \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] & (vB).ul[i] ; \ 

#define VANDC( vT, vA, vB ) \ 
{ \ 

ulong x; \ 

for ( i = 0; i < 4; i++ ) \ 

<vT).ul[i] = (vA).ul[i] & ~(vB) .Ul[i] ; \ 

} 

ttdefine VCMPEQFP( vT, vA f vB ) \ 
{ X 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).f[i] » (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPEQFP C( vT, vA, vB ) \ 

{ \ . " 

ulong i; \ 
ulong t, f; \ 
t « Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] m ( (vA).f[i] == (vB).f[i] ) ? Oxffffffff : 0; \ 
t (vT) .ul[i] ? \ 
f |- (vT) ,ul[i] ? \ 

- 1 \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6J = 0x2; \ 

else CRC6] « 0; \ 

#define VCMPEQUB( vT, vA f vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = ( (vA).uc[i] == (vB).uc[i] ) ? Oxff : 0; \ 

#define VCMPEQUB C{ vT, vA, vB ) \ 

{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ 

(vT).ucti] = { (vA).uc[i] » (vB).uc[i] ) ? Oxff : 0; \ 
t (vT) .uc[i] ; \ 

f |» (vT) .uc[i] ; \ 

} \ 

if ( t ) CRE6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPEQUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 
^ (vT).us[i] = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 

#define VCMPEQUH C{ vT, vA f vB ) \ 
{ \ 

ulong i; \ 
ushort t # f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 
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<vT).us[i] = { <vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 

t &= (vT) .us[i] ; \ 

f |= (vT) .us[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPEQUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

CvT).ul[i] » { <vA).ul[i] == (vB).ul[i] ) ? Oxffff ffff : 0; \ 

I 

#define VCMPEQUW C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i - 0; i < 4; i++ ) { \ 

<vT).ul[i] = ( (vA).ul[i] (vB).ul[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul[i] ; \ 
f |- (VT) -UlCiJ ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGEFP ( vT, vA, vB ) \ 
( \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( <vA).f[i] >= (vB).f[iJ ) ? Oxffffffff : 0; \ 

#define VCMPGEFP C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 

t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ulti] = ( (vA).f[i] >= (vB).f[i] ) ? Oxffffffff : 0; \ 
t (vT) .ulfi] ; \ 
f |= (vT) .ul[i] ; \ 

if ( t ) CR[63 = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6} = 0; \ 

#define VCMPGTFP( vT, vA, vB ) \ 
[\ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[ij = ( <vA).f[i] > (vB).f[i3 ) ? Oxffffffff : 0; \ 

#define VCMPGTFP C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).f[i] > (vB).f[i] ) ? Oxffffffff ; 0; \ 
t &= <vT) .ul[i] ; \ 
f |= (vT) .ul[i] ; \ 
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if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

<vT).UC[i] = ( <vA).c[i] > (vB).G[i] ) ? Oxff : 0; \ 

#define VCMPGTSB C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t, f; \ 
t n Oxff; \ 
f = 0; \ 

for ( i - 0; i < 16; i++ ) { \ 

(vT).uc[iJ = ( (vA).c[i3 > (vB).c[iJ ) ? Oxff : 0; \ 

t (vT) .\ic[i] ; \ 

f |- (vT).uc[i); \ 
} V 

if { t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGTSH( vT, vA, vB ) \ 
{ \ 

ulong if \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = ( (vA).s[i] > (vB).s[i] ) ? Oxffff : 0; \ 

} 

tfdefine VCMPGTSH C( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 

(vT).us[i] = ( (vA).s[i3 > (vB).s[i] ) ? Oxffff : 0; \ 
t fe. (vT) .usti] ; \ 
f |= (vT) .uali] ; \ 

}\ 

if { t ) CRt6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i3 = ( (vA).l[iJ > (vB).l[i] ) ? Oxffffffff : 0; \ 

#define VCMPGTSW C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong t, f; \ 

t *= Oxffffffff; \ 

f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ulti] = ( (vA).lCi] > (vB).lIi] ) ? Oxffffffff.: 0; \ 

t &= (vT) .Ul[i] ; \ 
^ |= (vT) .Ullij ; \ 

if ( t ) CR[6] = 0x8; \ 

else if ( Lf ) CR[63 = 0x2; \ 

else CR[6] = 0; \ 
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#define VCMPGTUB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).UC[i] a ( {vA).uc[i] > (vB).ucCi] ) ? Oxff : 0; \ 

#define VCMPGTUB C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t f f ; \ 
t = Oxff; \ 
f - 0; \ 

f or < i = 0; i < 16; i++ ) { \ 

(vT).uc[i] a ( (vA).ucti] > (vB).uc[i] } ? Oxff : 0; \ 
t &= (vT).ucrij; \ 
f |= <vT) .uc[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGTUH( vT, VA, VB ) \ 

{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[iJ = ( (vA).us[i] > (vB).us[i] ) ? Oxffff : 0; \ 

#define VCMPGTUH C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ushort t f f; \ 
t = Oxffff; \ 
f = 0; \ 

for { i = 0; i < 8; i++ ) { \ 

(vT).us(i] = ( (vA).us[i] > (vB).us[i] ) ? Oxffff : 0; \ 
t (vT) .us[i] ; \ 
f |= (vT) .us[i] ; \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).ulti] = ( (vA).ultiJ > (vB).ulCi] ) ? Oxffffffff : 0; \ 

#define VCMPGTUW_C( vT, vA, vB ) \ 

ulong i; \ 
ulong t r f ; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i =* 0; i < 4; i++ ) { \ 

(vT).ul[i] m ( (vA).ul[i] > (vB).ul[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul[i] ; \ 
f |= (vT) .ul[i] ; \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCFSX( vT, vB, UIMM ) \ 

float fj; \ 
ulong i, j; \ 
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j = (127 - ((UIMM) & Oxlf)) « 23; \ 

fj = * (float *)&j; \ 

for ( i = 0/ i < 4; i++ ) \ 

<vT).f[iJ = (float) <<vB) -l[i]) / fj; \ 

#define VCFDX( vT, vB, UIMM ) \ 
{ \ 

float fj; \ 
ulong i, j ; \ 
* j = (127 - ( (UIMM) & Oxlf)) « 23; \ 

fj = * (float *)&j; \ 
for ( i = 0; i < 4; i++ ) \ 

(vT).f[i) = (float) <(vB) .ul[i]) / fj; \ 

#define VCTSXS ( vT, vB, UIMM ) \ 
{ \ 

float f r g, max, scale; \ 
ulong i; \ 
long 1; \ 

i = (127 + 31) « 23; \ 
max = * (float *)&i; \ 

i = (127 + ( (UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 

g = f * scale; \ 

if ( g <- -max ) 1 = 0x80000000; \ 
else if ( g >= max ) 1 - 0x7fffffff ; \ 
else 1 = (long)f << ( (UIMM) & Oxlf); \ 
(vT) .l[i] = 1; \ 

. * 

#define VCTOXS { VT, vB, OIMM ) \ 
{ \ 

float f, g, max, scale; \ 
ulong i, ul; \ 
i = (127 + 32) « 23; \ 
max = * (float *)&i; \ 

i = (127 + ( (UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (VB) .f ti] ; \ 

g a f * scale; \ 

if ( g <= 0 ) ul = 0; \ 

else if ( g >= max ) ul = Oxffffffff ; \ 

else ul = (ulong) f « ((UIMM) & Oxlf); \ 

(vT) .ul[i] = ul; \ 

, ix 

tfdefine VEXPTEFP{ vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).f[i] = exp(0. 693147180559945 * (vB).f[i]>; \ 

#define VLOGEFP( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 
^ (VT).f[ij = 1.442695040888963 * log ( (vB) . f [i] ) ; \ 

ttdefine VMADDFP ( vT, vA, vC, vB ) \ 

ulong i; \ 
float a, b, c, d; \ 
for ( i = 0; i < 4; i++ ) { \ 
a = (vA).f [i] ; \ 

b = (VB) .f [±]; \ 
C = (VC) -f [i] ; \ 
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d = a * c; \ 
d m b + d; \ 
(vT) .f [i] = d; \ 

ttdefine VMAXFP( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

<vT).f[i] = ((vA).f[i3 >= (vB).fCi]) ? (vA).f[i] : (vB).f[iJ? \ 

#define VMAXSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i m 0; i < 16; i++ ) \ 
^ (vT).c[i] = ((vA).cCi] >= (vB).c[i]> ? (vA).c[i] : (vB).c[i] ; \ 

ttdefine VMAXSH ( vT, vA f vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 
^ (vT).s[i3 = ((vA).s[i] >= (vB).s[i]J ? (vA).s[i3 : (vB).s[i] ; V 

#define VMAXSW( vT # vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).l[i] = ((vA).l[i] >= (vB).l[i]) ? (vA).lEi] : (vB).l[i] ; \ 

#define VMAXUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).uc[i] - ((vA).uc[i] >= (vB).uc[i]) ? (vA).uc[i] : (vB).uc[i] ; \ 

#define VMAXUH( vT ( vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 
^ (vT).us[i3 = ((vA).usCi] >= (vB).us[i]) ? (vA).us[i3 : (vB) .us [i] ; \ 

#define VMAXUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ((vA).ul[i] >= (vB),ul[i]) ? (vA).ultil : (vB) .ul [i] ; \ 

#define VMHADDSHS ( vD, vA, vB, vC ) \ 

ulong i; \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a= (long) (vA) .s[i] * (long) (vB) . s [i] ; \ 
a >>= 15; \ 

a += (long) (vC) .s[i] ; \ 

if ( a > 32767 ) a = 32767; \ 

else if ( a < -32768 ) a = -32768; \ 

(vD) .s[i] = (short) a; \ 

, M 

#define VMHRADDSHS ( vD, vA, vB, vC ) \ 

ulong i; \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a= (long) (vA) .s[i] * (long) (vB) .s [i] ; \ 
a 0x00004000; \ 
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a »= 15; \ 

a +« (long) (vC) .s[i] ; \ 

if ,( a > 32767 ) a = 32767; \ 

else if ( a < -32768 ) a = -32768; \ 

(vD) ,s[i] = (short) a; \ 

. ,x 

#define VMINFP( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = ((vA)-fti] <= <vB).f[i)) ? (vA).f[i] : (vB).f[i3; \ 

#define VMINSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ <vT).c[i3 = ((vA).c[i] <= (vB).c[i3) ? (vA).c[i] : (vB).cfi]; \ 

#define VMINSH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).s[i] = ((vA).s[i] <= (vB).s[i]) ? (vA).s[i3 : (vB).s[i] ; \ 

#define VMINSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).lCi] - (<vA).l[i] <= (vB).l[i]> ? (vA).l[i] : (vB).l[i]; \ 

#define VMINUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).uc[i] = {(vA).ucti] <= (vB).uc[il) ? (vA) .uc[i] : (vB).ucti]; \ 

#define VMINUH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).us[ij = (<vA).us[i) <= (vB).us[i]) ? (vA).ue[i] : (vB).us[i] ; \ 

#define VMINUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i » 0; i < 16; i++ ) \ 
^ (vT).ul[i] = ((vA).ul[i] <= (vB).ul[iJ) ? (vA).ul[i3 : (vB).ul[i3; \ 

#define VMLADDUHM ( vD, vA, vB, vC ) \ 
{ \ 

ulong i.; \ 

ulong a, b, c; \ 

for ( i =0; i < 8; i++ ) { \ 

a = (ulong) (vA) .us [i3 ; \ 

b = (ulong) (vB) .us [i3 ; \ 

c = (ulong) (vC) ,us[i3 ; \ 

c += (a * b) ; \ 

(vD).us[i3 = (ushort)c; \ 

} } X 

#define VMR( vD, vS ) \ 
ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vD) .ul[i3 = (vS) .ul[i3 ; \ 
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ttdefine VMRGHB BE ( vT, vA f vB ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 

for ( i = 0; i < 8; i++ ) { \ 
j « i + i; \ 
v.uc[j] = (vA).uc[i]; \ 
v.uc[(j+l)] = <vB).uc[i]; \ 

} \ 

for ( i = 0? i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

#define VMRGHH BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 

for ( i = 0; i < 4; i++ ) { \ 
j = i + i; \ 
v.usfj] = (vA).us[i]; \ 
v.us[(j + l)] = {vB).us[i]; \ 

} N 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ultij ; \ 

#define VMRGHW BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 

for ( i = 0; i < 2; i++ ) { \ 
j - i + i; \ 
v.ul[j] - (vA) .ul[i] ; \ 
V.ul[(j+1)] - (vB) .Ul[i]; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

#define VMRGLB BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 

for ( i = 0; i < 8; i++ ) { \ 
j = i + i; \ 

v.uc[j] = (vA) .uc[{8+i)] ; \ 
v.uc[(j+l)3 = (vB).uc[(B+i)]; \ 

} X 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] / \ 

} 

#define VMRGLH BE { vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

for { i « 0; i < 4; i++ ) { \ 
j - i + i; \ 

v.us[j] = (vA) .us [ (4+i) ] ; \ 
v.us[(j+l>] = (vB) .us [(4+i) ] ; \ 

} \ 

for { i * 0; i < 4; i++ ) \ 
(vT) .ul[i) = v.ul[i] ; \ 

} 

#define VMRGLW BE ( vT , vA , vB ) \ 
( \ 

VMX reg v; \ 
ulong i, j; \ 

for { i » 0; i < 2; i++ ) { \ 
v.ultjl =' (vA) .ul[(2+i)] ; \ 
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^ ^v.ul[(j+l)] = (vB) .ul[(2+i)] ; \ 

for ( i = 0; i < 4; i++ ) \ 
(VT) .ul[i] = v.ul[ij ; \ 



#if defined ( LITTLE ENDIAN 



#def ine 


VMRGHB ( 


VT, 


vA, 


vB ) 


VMRGLB BE { 


vT, 


vB, 


vA 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB ) 


VMRGLH BE ( 


vT, 


vB, 


vA 


#def ine 


VMRGHW ( 


VT, 


vA, 


vB ) 


VMRGLW BE ( 


vT, 


vB, 


vA 


#def ine 


VMRGLB ( 


vT, 


VA, 


vB ) 


VMRGHB BE ( 


vT, 


vB, 


vA 


# define 


VMRGLH ( 


vT, 


vA, 


vB ) 


VMRGHH BE { 


vT, 


vB, 


vA 


#define 


VMRGLW ( 


VT, 


vA, 


vB ) 


VMRGHW_BE ( 


vT, 


vB, 


VA 


ftelse 
















#define 


VMRGHB ( 


vT, 


vA, 


vB ) 


VMRGHB BE ( 


vT, 


vA, 


vB 


#define 


VMRGHH ( 


VT, 


vA, 


vB > 


VMRGHH BE ( 


vT, 


vA, 


vB 


#def ine 


VMRGHW ( 


vT, 


vA, 


vB ) 


VMRGHW BE ( 


vT, 


vA, 


vB 


#define 


VMRGLB ( 


VT, 


vA, 


vB ) 


VMRGLB BE ( 


vT, 


vA, 


vB 


#define 


VMRGLH ( 


vT, 


vA, 


vB ) 


VMRGLH BE ( 


vT, 


vA, 


vB 


#define 


VMRGLW ( 


vT, 


vA, 


vB ) 


VMRGLW BE ( 


vT, 


vA, 


vB 


#endif 















#define VMSUMMBM ( vT, vA, vB, vC ) \ 

ulong i, j ; \ 
long a, c; \ 
ulong b; \ 

for ( i = 0; i < 4; i++ ) { \ 

C - (VC) .l[i] ; \ 

for ( j = 0; j < 4; j++ ) { \ 
a = (long) (vA) .c[4*i+j] ; \ 
b = (ulong) (vB) .uc[4*i+j] ; \ 
c += (a * b) ; \ 

} \ 

(VT) .l[i] = C; \ 



#define VMSUMSHM( vT, vA, vB, vC ) \ 

ulong i, j; \ 
long a, b, c; \ 
for ( i = 0? i < 4; i++ ) { \ 
c = (vC) .lti] ; \ 
for ( j = 0; j < 2; j++ ) { \ 
a - (long) (vA) .s[4*i+jj ; \ 
b = (long) (vB) .s[4*i+j] ; \ 
^ += (a * b) ; \ 

(vT) .l[i] = c; \ 



#define VMSUMSHS( vT, vA, vB, vC ) \ 

ulong i, j; \ 
long a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c = (double) (vC) .l[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

a = (long) (vA) .s[4*i+j] ; \ 

b= (long) (vB) .s[4*i+j] ; \ 
^ += (double) (a * b) ; \ 

if ( C >= 2147483647.0 ) C = 2147483647.0; \ 

else if ( C -2147483648.0 ) C = -2147483648.0; \ 

(vT) .l[i] = (long)c; \ 
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#define VMSUMUBM( vT, vA, vB, vC ) \ 
{ \ 

ulong x, 37 \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul[i] ; \ 
for ( j = 0; j < 4? j++ ) { \ 
a = (ulong) (vA) *uc[4*i+j] ; \ 
b = (ulong) (vB) .uc[4*i+j] ; \ 
c += (a * b) ; \ 

} \ 

(vT) -Ul[i] = c; \ 

#define VMSUMUHM( vT, vA, vB, vC ) \ 

{ \ . . 4 

ulong a, 37 \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul[i] ; \ 
for ( j =0; j < 2; j++ ) { \ 
a = (ulong) (vA) .us [4*i+j] ; \ 
b = (ulong) (vB) .us [4*i+j] ; \ 
c += (a * b) ; \ 

} \ 

(vT) .ulEi] = c; \ 

, M 

#define VMSUMUHS( vT, vA, vB, vC ) \ 
{ \ 

ulong i, 3; \ 
ulong a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c » (double) (vC) .ul[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

a = (ulong) (vA) .us[4*i+j]; \ 

b = (ulong) (vB) .us [4*i+j] ; \ 

c += (double) (a * b) ; \ 

} \ 

if ( c >= 4294967295.0 ) C = 4294967295.0; \ 
(vT).ul[i] = (ulong) c; \ 

#define VMULESB( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

long a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (long) (vA) .c[2*i] ; \ 

b = (long) (vB) .c[2*i] ; \ 

c = a * b; \ 

(vT).s[i] = (short)c; \ 

. ,v 

#def ine VMULKSH ( vT, vA, vB ) • \ 
( \ 

ulong 17 \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (long) (vA) .s[2*i] ; \ 

b .« (long) (vB) .s[2*i] ; \ 

c = a * b; \ 

(vT) .l[i] = (long)c; \ 
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#define VMULEUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( x = 0; i < 8/ i++ ) { \ 
a = (ulong) (vA) .uc[2*i] ; \ 
b = (ulong) (vB) .uc[2*i] ; \ 
c = a * b; \ 

(vT).us[i] o (ushort)c; \ 

. ,N 

#define VMULEUH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 
' for ( i = 0; i < 4; i++ ) { \ 
a = (ulong) (vA) .us [2*i] ; \ 
b = (ulong) (vB) .us [2*i] ; \ 
c = a * b; \ 

^ ^(vT).ul[i] = (ulong)c; \ 

#define VMULOSB( vT, vA, vB ) \ 

ulong i; \ 
long a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (long) (vA) .c[2*i+l] ; \ 
b = (long) (vB) .c[2*i+l] ; \ 
c = a ■ * b; \ 

^ ^(vT) .s[i] « (short) c; \ 

#define VMUL0SH< vT, vA, vB ) \ 

ulong i ; \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 
a = (long) (vA) .s [2*i+l] ; \ 
b = (long) (vB) .s[2*i+l] ; \ 
c = a * b; \ 

^ ^<vT) .l[i] = (long)c; \ 

#define VMULOUB ( vT, vA, vB ) \ 

ulong i; \ 

ulong a, b, c; \ 

for { i = 0; i < 8; i++ ) { \ 
a - (ulong) (vA) .uc[2*i+l] ; \ 
b = (ulong) (vB) .uc[2*i+l] ; \ 
c = a * b; \ 

(vT) .us[ij = (ushort)c; \ 

) ix 

#define VMULOUH( vT, vA, vB ) \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 
a = (ulong) (vA) .us [2*i+lJ ; \ 
b = (ulong) (vB) .us [2*i+l] ; \ 
c = a * b; \ 

^ ^(vT).ul[i] = (ulong)c; \ 

#define VNMSUBFP( vT, vA, vC, vB ) \ 
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{ \ 

ulong i; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f til; \ 

b = (vB) .f [i3; \ 

c = (vC) .f [ij; \ 

d = a * c; \ 

d = b - d; \ 

(vT) .f [i] = d; \ 

#define VNOR( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT)'.ul[i] = ~(<vA) .ul[i] | (vB).ul[i3); \ 

#define VN0T{ vT, vA ) VN0R( vT, vA, vA ) 

#define V0R( vT, vA, vB ) \ 

{ X 

ulong i,- \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ulti] = (vA).ul[iJ | (vB).ul[i] ; \ 

#define VPERM BE ( vT, vA, vB, vC ) \ 

{ V 

VMX reg v; \ 

ulong field, i; \ 

for ( i - 0; i < 16; i++ ) { \ 

field = (vC) .ucti] ; \ 

v.uc[i] = ( field < 16 ) ? (vA) .uc [field] : (vB) .uc [field - 16]; \ 

} \ 

for { i = 0; i < 4; i++ ) \ 
{vT) .ul[i] = v.ul[i] ; \ 

} 

#define VPKUHUM BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 
v.uc[i] « (vA) .uc[(j)] ; \ 
v.uc[i+8] = (vB) .uc[(j>] ; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(VT) .ul[i] = v.ulti] ; \ 

} 

#define VPKUHUS BE ( vT, vA, vB, base ) \ 

{ x 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

v.uc[i] = (vA) .uc[(j"l)] ? (uchar)255 : (vA).uc[<j)]; \ 
v.uc[i+8] = (vB) .uc[ (j A l)3 ? (uchar)255 : (vB) .uc [ { j ) ] ; \ 
j += 2; \ 

} \ 

for ( i » 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ? \ 

} 

#define VPKSHUS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 
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for ( i =0; i < 8; i++ ) { \ 

if ( (vA).sti] <= 0 ) v.uctiJ = 0; \ 

else if ( (vA).sEi] >= 255 ) v.uc[i] = 255; \ 

else v.uc[i] * (vA).ucEj]; \ 

if ( (vB).stil <= 0 ) v.uc[i+8] = 0; \ 

else if ( (vB).sEi] >= 255 ) v.uc[i+83 « 255; \ 

else v.uc[i+8] = (vB).ucfj]; \ 

. v j " 21 v 

for { i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

ttdefine VPKSHSS_BE ( vT, vA, vB, base ) \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).s[i] <= -128 ) v.c[i} = -128; \ 
else if ( (vA).s[i] >= 127 ) v.c[i] = 127; \ 
else v.c[i] = (vA).c[j]; \ 

if ( (vB).sti] <= -128 ) v.c[i+8] = -128; \ 
else if ( (vB).stil >- 127 ) v.c[i+8] - 127; \ 
else v.c[i+8] = (vB).c[j]; \ 
j += 2; \ 

} \ 

for ( i * 0; i < 4; i++ ) \ 
^ (vT) .ul[i] = v.ul[i] ; \ 

#define VPKUWUM__BE( vT, vA, vB, base ) \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i =0; i < 4; i++ ) { \ 
v.us[i] = (vA) .us[(j)3 ; \ 
v.usEi+4] = (vB) .us[(j)] ; \ 
j +» 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT) .ul[i] = v.ul[i] ; \ 

#define VPKUWUSJBE( vT, vA/vB, base ) \ 

VMX reg v; \ 
lilong i, j; \ 
j = base; \ 

for ( i = 0; i < 4; i++ ) { \ 

v.usEi] = (vA) .usE(j"l)] ? (ushort) 65535 : (vA) . us E { j ) ] ; \ 
v.usEi+4] = (vB) .us[(j^l)] ? (ushort) 65535 : (vB) .us E ( j ) ] ; \ 

}\ j+=2;X 

for ( i = 0; i < 4; i++ ) \ 
^ (vT) .ul[i] = v.ulfi] ; \ 

#define VPKSWUSJBE ( vT, vA, vB, base ) \ 

VMX reg v; \ 
ulong i, j; \ 
j ■ base; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).lEi] 0 ) v.usEi] = 0; \ 

else if ( (vA).l[i] >= 65535 ) v.usEi] = 65535; \ 

else v.usEi] = <vA).us[j]; \ 

if ( (vB).l[i] <= 0 ) v.us[i+4] = 0;"\ 

else if ( (vB).l[i] >= 65535 ) v.usEi+4] = 65535; \ 

else v.usEi+4] = (vB).us[j] ; \ 
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} 



j += 2; \ 

} \ 

for ( i = 0; i 
(vT),ul[i] = 



< 4; i++ ) \ 
V.ul[iJ; \ 



#define VPKSWSS BE { vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 

for { i = 0; i < 8; i++ ) { \ 

if ( (vA).l[i] <=* -32768 ) v.s[i] = -32768; \ 
else if ( (vA).l[i3 >= 32767 ) v.sli] = 32767; \ 
else v.s[i] = (vA).s[j]; \ 

if ( (vB).lCi] <= -32768 ) V.sTi+8] - -32768; \ 
else if { (vB).l[i3 >= 32767 ) v.s[i+8] = 32767; \ 
else v.s[i+8] = (vB).s[j] ; \ 
2; \ 



} 



J += 

} \ 
for { i 
(vT) . 



= 0; i < 4; i++ ) \ 
ul[i] = v.ullll; \ 



vA, vB 
vA, vB 



#if defined ( LITTLE ENDIAN ) 
#define VPERM( vT, vA, vB, vC ) 
#define VPKUHUM( vT, vA, vB 
^define VPKUHUS( vT, 
#define VPKSHUS { vT, 
#define VPKSHSS ( vT, 
#define VPKUWUM ( vT, 
tfdefine VPKUWUS { vT, 
#define VPKSWUS ( vT, 
#define VPKSWSS ( vT, 
#else 

#define VPERM{ vT, vA, vB, vC ) 
#define VPKUHUM( vT, vA, vB 
#define VPKUHUS( vT, 
#define VPKSHUS ( vT, 
#define VPKSHSS ( vT, 
#define VPKUWUM( vT, 
#define VPKUWUS { vT, 
#define VPKSWUS { vT, 
#define VPKSWSS ( vT, 
#endif 



vA, 
vA, 
vA, 
vA, 
vA, 



vA, 
vA, 
vA, 
vA, 
vA, 
vA, 
vA, 



VB 
VB 
VB 
vB 
VB 



VB 
VB 
VB 
VB 
vB 
vB 
vB 



VPERM BE ( vT, vB, vA, vC ); 

VPKUHUM BE ( vT, vB, vA, 0 ) 

VPKUHUS BE ( vT, vB, vA, 0 ) 

VPKSHUS BE ( VT, vB, vA, 0 ) 

VPKSHSS BE ( vT, vB, vA, 0 ) 

VPKUWUM BE ( vT, vB, vA, 0 ) 

VPKUWUS BE ( vT, vB, vA, 0 ) 

VPKSWUS BE ( vT, vB, vA, 0 ) 

VPKSWSS_BE( vT, vB, vA, 0 ) 

VPERM BE ( vT 7 vA, vB, vC ); 

VPKUHUM BE ( vT, vA, vB, 1 } 

VPKUHUS BE ( vT, VA, vB, 1 ) 

VPKSHUS BE ( vT, vA, vB, 1 ) 

VPKSHSS BE ( vT, vA, vB, 1 ) 

VPKUWUM BE ( vT, vA, vB, 1 ) 

VPKUWUS BE ( vT, vA, vB, 1 ) 

VPKSWUS BE ( vT, vA, vB, 1 ) 

VPKSWSS__BE( vT, vA, vB, 1 ) 



#define VREFP ( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 
^ (VT) .f [i] « 1.0 / (VB) .f [ij ; \ 

#define VRFIM( vT, vB ) \ 
{ \ 

float f, max, r-; \ 
ulong i; \ 

i = (127 + 31) « 23; \ 
max = * (float \ 
for ( i =0; i < 4; i++ ) { \ 
f = (vB) .f [i] ; \ 

if ( (f >= -max) && (f < max) ) { \ 
r = (float) ((long)f); \ 
if ( r > f ) — r; \ 

^ ^(VT) .f [i] - f; \ 

#define VRFIN( vT, vB ) \ 
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{ \ 

float f, r, s; \ 
ulong i; \ 
long lr; \ 

for { i = 0; i < 4; i++ ) { \ 
s = f = (vB) .f [i] ; \ 
if ( f < 0.0 ) f = -f; \ 
r = f + 0.5; \ 
if { r != f ) { \ 

lr = (long)r; \ 

f = (float) lr; A 

if ( f == r ) f = (float) (lr & ~1) ; \ 

} \ 

if ( s < 0.0 ) f = -f; \ 
(VT) .f [i] = f ; \ 

#de£ine VRFIP( vT, vB ) \ 
{ \ 

float f, max, r; \ 
ulong i; \ 

i = (127 + 31) « 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (vB) .f Ci]; \ 

if ( (f >= -max) && (f < max) ) { \ 
r = (float) ((long)f); \ 
if ( r < f ) ++r; \ 
f = r; \ 

} \ 

(vT) .f [i] = f; \ 

#define VRPIZ( vT, vB ) \ 
{ \ 

float f, max; \ 
ulong i; \ 

i = (127 + 31) « 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (vB) .f [i3; \ 

if ( (f >= -max) && (f < max) ) \ 

f = (float) ((long)f); \ 
(vT) .f [i] = f; \ 

" , ,x 

. fldefine VRLB ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i s 0; i < 16; i++ ) { \ 
sh = (vB) .uc[i] & 0x7; \ 

(vT).uc[ij » ((vA).uc[i] « sh) I ((vA).uc[i] » (8-sh)) ; \ 

, n 

Sdefine VRLH ( vT, vA, vB ) \ 
{ \ 

. ulong i, sh; \ 
for ( i = 0; i < 8; i++ ) { \ 

sh = (vB).us[x] & Oxf; \ 
^ ^(vT).us[i] = ((vA).usfi] « sh) | ((vA).usEi] » (16-sh)); \ • 

^define VRSQRTEFP ( vT, vB ) \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).f[i] = 1.0 / sqrt{ (vB) .f [i] ) ; \ 
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#define VRLW( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for { i = 0 ; i < 4 ; i++ ) { \ 

sh = (vB).ulli] & Oxlf; \ 
^ ^(vT).ul[i] = ((vA).ulli3 « sh) ] (<vA).ul[i3 » <32-sh)) ; \ 

#define VSEL( vT, vA, vB ( vC ) \ 
{ \ 

ulong atemp, btemp, i; \ 

for ( i = 0; i < 4; i++ ) { \ 

atemp = (vA).ul[i] & ~(vC).ul[i3; \ 

btemp = (vA).ul[i] & (vC).ul[i3; \ 
^ ^(vT).ul[i] = atemp | btemp; \ 

#define VSL( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 
sh = (vB).ul[3] & 0x7/ \ 



(vT).ul[03 = ((vA).ul[03 « sh) 

(vT).ul[l3 = ({vA).ul[l] « sh) 

<vT).ul[2l = <{vA).ul[23 « sh) 

(vT).ul[33 = (vA).ul[3] « sh; \ 

#define VSLDOI ( vT, vA, vB, UIMM ) \ 
{ \ 

VMX reg v; \ 

ulong i, j , sh; \ 

sh = (UIMM) & Oxf ; \ 

for ( i = 0; i < (16-sh) ; i++ ) \ 

v.uc[i3 = (vA) .uc [i+sh3 ; \ 
for ( j = i; j < 16; j++ ) \ 

v.uc[j] = (vB) .uc [j-i] ; \ 
for ( i = 0; i < 4; i++ ) \ 

(vT) .ul[i3 = v.ul[i3; \ 

#define VSLB( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i3 Sc 0x7; \ 
(vT).uc[i] = (vA).uc[i] « sh; \ 

) 1N 

#define VSLH( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for { i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i] & Oxf; \ 
(vT).us[i3 = (vA).us[i3 « sh; \ 

. M 

#define VSL0( vT, vA, vB ) \ 
{ \ 

ulong i, j, sh; \ 

sh = ((vB).ul[3] » 3) & Oxf; \ 

for ( i = 0; i < (16-sh); i++ ) \ 

(vT).uc[i3 = (vA) .uc[i+sh3 ; \ 
for ( j = i; j < 16; j++ ) \ 
^ (vT) ,uc[j] = 0? \ 

#define VSLW( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 



((vA) .ul[l3 » (32-sh)) ; \ 
((vA) .Ul[23 » (32-sh)) ; \ 
{(vA) .ul[33 » (32-sh)) ; \ 
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} 



} \ 



sh = (vB).ul[i] & Oxlf; \ 
(vT).ul(i] = (vA).ulCi] « sh; \ 



#define VSR( vT, vA, vB ) \ 

ulong i, sh; \ 
Sh = (vB) .ul [3] & 0X7; \ 
(vT).ul[3] = {(vA).ul[3] » sh) 
(vT).ul[2] = ((vA).ul[2] » sh) 
(vT).ul[l] = <{vA).ul[l) » sh) 
^ (vT).ul[0] = (vA).ul[0] » sh; \ 

#define VSRAB( vT, vA, vB ) \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 

sh = (vB).uc[i] & 0x7; \ 
^ ^(vT).cfi] = (vA).cti] » sh; \ 

#define VSRAH( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 

sh = <vB) .usfi] & Oxf ; \ 
^ ^(vT).s[i] = (vA).s[i] » sh; \ 

tfdefine VSRAW( vT, vA, vB ) \ 
ulong i, sh; \ 

for { i = 0; i < 4; i++ ) { \ 

sh = (vB).ul[i3 & Oxlf; \ 
^ ^(vrKl[i] = (vA).l[i3 » sh; \ 

#define VSRB ( vT, vA, vB ) \ 
ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 

sh = (vB) .uc[i] & 0x7; \ 
^ ^(vT).uctil = <vA).uc[i] » sh; \ 

#define VSRH{ vT, vA, vB ) \ 
ulong i, sh; \ 

for { i = 0; i < 8; i++ ) { \ 
sh = (vB) .us[i] & Oxf; \ 
<vT).us[i] = (vA).us[i] » sh; \ 

} X 

#define VSRO ( vT, vA, vB ) \ 

long i, j, sh; \ 

sh = (<vB).ul[3] » 3) & Oxf; \ 

for ( i = 15; i >= sh; i~~ ) \ 

(vT).uc[i] = (vA) .uc[i-sh] ; \ 
for ( j = i; j >= 0; j-- ) \ 
^ (vT) .uc[j] = 0; \ 

#define VSRW( vT, vA, vB ) \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
sh = (vB).ul[i] Sc Oxlf; \ 



(<VA) .ul[2] « (32-sh)); \ 
((vA) .ul[l] « {32-sh)) ; \ 
(<vA) .ul[0] « (32-sh)); \ 
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(vT).ul[i] = (vA).ul[i] » sh; \ 

, ,v 

ttdefine VSPLTB( vT, vB, UIMM ) \ 
{ \ 

uchar c; \ 
ulong i; \ 

C = (vB) .uc[C INDEX MUNGE ( UIMM ) & Oxf ] ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc[i] = c; \ 

#define VSPLTH( vT, vB, UIMM ) \ 
{ \ 

ushort s; \ 
ulong i; \ <} 

S = (vB).us[S INDEX__MUNGE ( UIMM ) & 0x7]; \ 
for ( i = 0; i < 8; i++ ) \ 
(VT) .us [i] = s; \ 

} 

#define VSPLTW( vT, vB, UIMM ) \ 

{ \ 

ulong i, 1; \ 

1 = (vB).ul[L I NDEX_MUNGE ( UIMM ) & 0x3]; \ 
for ( i = 0; i < 4; i++ ) \ 
^ (vT) .ul[i] = 1; \ 

#define VSPLTISB( vT, SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16;vi++ ) \ 
(vT).c[il = (char) (SIMM) ; \ 

#define VSPLTISH( vT, SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[iJ = (short) (SIMM) ; \ 

#define VSPLTISW( vT, SIMM ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).lti] - (long) (SIMM) ; \ 

#define VSUBFP( vT, vA, vB ) \ 

ulong i; \ 
float a, b f c; \ 
for { i = 0; i < 4; i++ ) I \ 
a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 
c = a - b; \ 
(VT) .f [i] = c; \ 

, ,x 

#defirie VSUBSBS( vT, vA, vB ) \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c[i] - (long) (vB) .c [i] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) <vT).c[i] = 127; \ 
else (vT).c[i] « (char)itemp; \ 

} } N 

#define VSUBSHS( vT, vA, vB ) \ 
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{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (long) (vA) .s[i] - (long) (vB) .s[i] ; \ 
if ( itemp < -32768 ) (vT).s[i] « -32768; \ 
else if ( itemp > 32767 ) (vT).s[i] = 32767; \ 
else (vT).s[i] = (short) itemp; \ 

, ,x 

#define VSUBSWS( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] - (vB).l[i]; \ 

if ( ( (vA).l[i] >= 0) && ( (vB).l[i] < 0) && (itemp < 0) ) \ 

(vT).l[i] = (long)0x7fffffff ; \ 
else if ( ( (vA).lCi] < 0) && ( (vB).l[iJ > 0) && (itemp > 0) ) \ 

(vT).lCil - (long) 0x80000000; \ 
else (VT).l = itemp [i] ; \ 

#define VSUBOBM( vT, vA, vB ) \ 
{ X 

ulong l ; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uctil = (vA).uc[i] - (vB).uc[i] ; \ 

#define VSUBUBS( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) { \ . 

if ( (vA).ucti] <= (vB).uc[i3 ) (vT) .uc[i] = 0; \ 
else (vT).uc[i] = (vA).uc[i] - (vB).uc[i]; \ 

#define VSUBUHMf vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[ij = (vA).us(i] - (vB).usti]; \ 

} 

#define VSUBUHSt vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i » 0; i < 8; i++ ) { \ 

if ( (vA).us[i] <= (vB).usti] ) (vT).us[iJ = 0; \ 
else (vT).us[i] = (vA).us[i] - (vB).us[i]; \ 

#define VSUBUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i3 = <vA).ul[i] - (vB).ul[i] ; \ 

#define VSUBUWS( vT, vA, vB ) \ 
{ X 

ulong 3L; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).uim <= (vB).uiti] ) (vn.uim = o ; \ 

else (vT).ul[i] = (vA).ul[i] - (vB).ul[i] ; \ 
#define VSUMSWS( vT, vA, vB ) \ 
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{ \ 



} 



ulong i; \ 
double sum; \ 

sum = {double) (vB) .1[L INDEXJVIUNGE ( 3 )]; \ 
for { i = 0; i < 4; i++ ) \ 

sum += (double) (vA) .l[i] ; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT).l[L INDEX MDNGE ( 3)3= 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )3 - 0x80000000; \ 
else \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = (long) sum; \ 



#define VSUM2SWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

double suml, sum2; \ 

sural = (double) (vB).l[L INDEX MUNGE ( 1 )]; \ 
sum2 = (double) (vB) .1 [L_INDEX_MUNGE( 3 )]; \ 
for ( i = 0; i < 2; i++ ) { \ 

suml += (double) (vA) .1[L INDEX MUNGE ( i )]; \ 
sum2 += (double) (vA) .1[L INDEX MUNGE ( i+2 )]; \ 

} \ 

if ( suml > (double) (0x7fffffff ) ) \ 

(vT).IUi INDEX MUNGE ( 1 )] = 0x7fffffff; \ 

else if ( suml < (double) (0x80000000) ) \ 

(vT) .1[L_INDEX__MUNGE{ 1 )] = 0x80000000; \ 

else \ . 

(vT) . 1 [L>_ INDEX MUNGE ( 1 )] = (long) suml; \ 
if { sum2 > (double) (0x7ffff fff ) ) \ 

(vT).l[L INDEX MUNGE ( 3 )] = 0x7fffffff; \ 
else if ( sum2 < (double) (0x80000000) ) \ 



} 



(vT) .1[L_INDEX_MUNGE( 3 )] » 0x80000000; \ 
else \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = (long)sum2; \ 



#define VSUM4SBS ( vT, vA, vB ) \ 
{ \ 

ulong i , j ; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .1 [i] ; \ 
for { j = 0; j < 4; j++ ) f \ 

sum += (double) (vA) .c[4*i + j] ; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT) .l[i] = 0x7fffffff ; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i] = 0x80000000; \ 
else \ 

(vT).l[i] = (long)sum; \ 



} 



} \ 



#define VSUM4SHS ( vT, vA, vB ) \ 
{ \ 

ulong i, j; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .1 [i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

sum += (double) (vA) .s[2*i + j] ; \ 
if ( sum > (double) (0x7fffffff ) ) \ 

(vT) .l[i] = 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i] = 0x80000000; \ 
else \ 

(vT) .l[i] = (long) sum; \ 
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#define VSUM4UBS ( vT, vA, vB ) \ 
{ \ 

ulong i, j; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .ul [i] ; \ 
for ( j = 0; j < 4; j++ ) { \ 

sum += (double) (vA).uc[4*i + j]; \ 

if ( sum > (2.0 * (double) (0x7fffffff) + 1.0) ) \ 

(vT).ul[iJ = Oxffffffff; \ 
else \ 

(vT).ul[i] = (ulong) sum; \ 

#define VUPKHSB BE( vT, vB ) \ 

{ \ ' 
long x; \ 

for ( i m 7; i >= 0; i-- ) \ 

(vT).s[i] = (short) (vB) .c[ij ; \ 

#define VUPKHSH BE ( vT, vB ) \ 
{ \ 

long i; \ 

for ( i = 3; i >= 0; i-- ) \ 

(vT).l[i] = (long) (vB) .s[i] ; \ 

#def ine VXJPKLSB "BE ( vT, vB ) \ 

{ \ 7 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i3 = (short) (vB) .c[i+8] ; \ 

} 

#define VUPKLSH BE ( vT, vB ) \ 

{ \ " 
ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).l[i] = (long) (vB) .sti+4] ; \ 

#if defined ( LITTLE END IAN ) 

#define VUPKHSB ( vT, vB ) VUPKLSB BE ( vT # vB ); 

#define VUPKHSH ( vT, vB ) VUPKLSH BE ( vT, vB ); 

#define VUPKLSB ( vT, vB ) VUPKHSB BE ( vT, vB ); 

#define VUPKLSH ( vT, vB ) VUPKHSH J3E( vT, vB ); 
#else 

' #define VUPKHSB ( vT, vB ) VUPKHSB BE ( vT, vB ); 

#define VUPKHSH ( vT, vB ) VUPKHSH BE ( vT, vB ); 

#define VUPKLSB ( vT, vB ) VUPKLSB BE ( vT, vB ); 

#define VUPKLSH { vT, vB ) VUPKLSH_BE( vT, vB ) ; 

#endif 

#define VX0R( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i « 0; i < 4; i++ ) \ 

(vT).ulti] - (vA).ul[i] A (vB).ul[i] ; \ 

} 

#endif /* end BUILD_MAX */ 

/* 

* stack and register macros 

*/ 
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#define VRSAVE_COND 7 /* recommended VR condition bit */ 

/* 

* macros to save and restore the CR register 
*/ 

#define SAVE CR 
#define REST_CR 

/* 

* macros to save and restore the LR register 
*/ 

#define SAVE LR 
#define RESTJLR 

/* 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE_AREA places the start of the GPR save area into a register 

* For MAX only: 
* 

* GET_VR_SAVE_AREA places the start of the VR save area into a register 
*/ 

ttdefine GET GPR SAVE AREA ( ptr ) \ 

ptr = (long) { ( (ulong) gpr_save_area + 15) & -15); 

#define GET FPR SAVE AREA( ptr ) \ 

ptr = (long) (( (ulong) fpr_save__area + 15) & -15); 

#if defined ( BUILD MAX ) 

#define GET VR SAVE AREA ( ptr ) \ 

ptr = (long) (( (ulong) vr save area + 15) & -15); 
#endif 

/* 

* macros to allocate and free space on the user stack. 

* For C implementation, the size is limited to 4096 bytes. 
*/ 

#define PUSH STACK ( nbytes ) \ 

sp = (long) (( (ulong) stack + 15) & -15); 

#define P0P_STACK( nbytes ) \ 
sp = 0; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
ptr = sp; 

#define FREE_STACK_SPACE ( nbytes ) P0PJSTACK( nbytes ) 

ttdefine CREATE_STACK FRAME ( nbytes ) \ 
PUSH_STACK( nbytes ) 

#define CREATE STACK FRAME X( nbytes ) \ 
CREATE_STACK_FRAME ( nbytes ) 

#define DESTROY_STACKJFRAME \ 
sp = 0; 

#define CREATE STACK BUFFER ( bufferp, byte align, nbytes ) \ 
ALLOCATE_STACK_SPACE< bufferp, nbytes )~ 

#define CREATE STACK BUFFER X( bufferp, byte_align, nbytes ) \ 
CREATE_STACK_BUFFER( bufferp, byte_align, nbytes ) 

ttdefine DESTROY_STACK_BUFFER \ 
sp = 0; 
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/* 

* macros to create salcache from the stack, used in ucode only 
*/ 

#def ine CREATE STACK SALCACHE \ 

char ^localcachebuffer[SALCACHE_ALLOC_SIZE] ? 

#define DESTRO Y_S TACK_SALCACHE 



* macros for saving and restoring non- volatile 

* floating point registers (FPRs) 
*/ 



iidof i no 


SAVE 


f 14 




JtH^f -? no 
■H-uex j. iic 


Onv jcj 


f 14 


f 15 


iiH^f "5 TIP 
ftU.tr JL X lie 


SAVE 


f 14 


f 16 


ttHaf n no 
ftUci. J. lie 


SAVE 


f 14 


f 17 


4i 4" 1 no 


SAVE 


f 14 


f 18 


44 H o f~ 4 no 
ttUc i — H ie 


SAVE 


f 14 


f 19 


trie i- x lie 


SAVE 


f 14 


f 20 


itHoF i no 

ft 1-1C J- J- 1 IC 


SAVE 


f 14 


f 21 


ttdof H no 


SAVE 


f 14 


f 22 


tiHpf i no 
ft vie i- J- lie 


SAVE 


f 14 


f 23 


iiHo"f*"i no 


SAVE 


f 14 


f 24 


44 H o F "i no 
trvic jl x i ic 


SAVE 


f 14 


f 25 


#def ine 


SAVE 


fl4 


f26 


#def ine 


SAVE 


fl4 


f27 


itHoF"! no 
tt^ e i_ x lie 


SAVE 


f 14 


f 28 


iiHoFi no 
ttvic line 


SAVE 


f 14 


f 29 


itHoFi no 
ft *xe x xiie 


SAVE 


f 14 


f 30 


&dof i no 

ffUCl X 1 iC 


SAVE 


f 14 


f 31 


&dpf_ ino 
ttuci x i ie 


SAVE 


dl4 




ilHo F i tip 
trvie x xi ie 


SAVE 


dl4 


dl5 


44 o "F ■? no 
tt vie x x lie 


SAVE 


dl4 


dl6 


JiHof""? no 
xr vie l — l ne 


SAVE 


dl4 


dl7 


#def ine 


SAVE 


dl4 


dl8 


itHoFi no 
Ttuc J- -L lie 


SAVE 


dl4 


dl9 


iirioFi no 


SAVE 


dl4 


d20 


iiHoFi no 


SAVE 


dl4 


d21 


^define 


SAVE 


dl4 


d22 


^define 


SAVE 


dl4 


d23 


^define 


SAVE 


dl4 


d24 


#def ine 


SAVE 


dl4 


d25 


#def ine 


SAVE 


dl4 


d26 


#def ine 


SAVE 


dl4 


d27 


#define 


SAVE 


dl4 


d28 


#define 


SAVE 


dl4 


d29 


#def ine 


SAVE 


dl4 


d30 


#def ine 


SAVE_ 


_dl4_ 


_d31 


#def ine 


REST 


fl4 




#def ine 


REST 


fl4 


fl5 


#def ine 


REST 


fl4 


fl6 


#def ine 


REST 


fl4 


fl7 


#def ine 


REST 


fl4 


fl8 


#def ine 


REST 


fl4 


fl9 


#def ine 


REST 


fl4 


f20 


#def ine 


REST 


fl4 


f21 


#def ine 


REST 


fl4 


f22 


#def ine 


REST 


fl4 


f23 


#def ine 


REST 


fl4 


f24 


#def ine 


REST 


fl4 


f25 


#define 


REST 


fl4 


f26 


#def ine 


REST 


fl4 


f27 


#define 


REST 


fl4 


f28 


#define 


REST 


fl4 


f29 


#define 


REST 


fl4 


£30 
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#define 


REST_ 


fl4_ 


_f31 


#define 


REST 


dl4 




#def ine 


REST 


d!4 


dl5 


ttdefine 


REST 


dl4 


dl6 


#define 


REST 


dl4 


dl7 


#define 


REST 


dl4 


dl8 


#def ine 


REST 


dl4 


dl9 






dl4 


d20 


^define 


REST 


dl4 


d21 


#def ine 


REST 


dl4 


d22 


ftdefine 


REST 


dl4 


d23 


ttdefine 


REST 


d!4 


d24 


#define 


REST 


dl4 


d25 


#def ine 


REST 


dl4 


d26 


#define 


REST 


dl4 


d27 


#define 


REST 


dl4 


d28 


#define 


REST 


dl4 


d29 


#define 


REST 


dl4 


d3 0 


#define 


REST 


dl4 


d31 


/* 









2/23/2001 



*■ macros for saving and restoring non -volatile 
k general purpose registers (GPRs) 
*/ 



TT'JCi. X11C 


SAVE 


rl3 




#def ine 


SAVE 


rl3 


rl4 


#def ine 


SAVE 


rl3 


rl5 


#def ine 


SAVE 


r!3 


rl6 


#def ine 


SAVE 


rl3 


rl7 


#def ine 


SAVE 


rl3 


rl8 


ttdefine 


SAVE 


rl3 


rl9 


#def ine 


SAVE 


rl3 


r20 


#define 


SAVE 


rl3 


r21 


#define 


SAVE 


rl3 


r22 


#def ine 


SAVE 


rl3 


r23 


#def ine 


SAVE 


rl3 


r24 


#def ine 


SAVE 


rl3 


r25 


#def ine 


SAVE 


rl3 


r26 


#def ine 


SAVE 


r!3 


r27 


#def ine 


SAVE 


rl3 


r28 


#def ine 


SAVE 


rl3 


r29 


ftdefine 


SAVE 


rl3 


r30 


#define 


SAVE. 


_rl3_ 


_r31 


#def ine 


REST 


rl3 




#def ine 


REST 


rl3 


rl4 


#define 


REST 


rl3 


rl5 


#define 


REST 


rl3 


rl6 


#define 


REST 


rl3 


rl7 


#define 


REST 


r!3 


rl8 


#define 


REST 


rl3 


rl9 


#define 


REST 


rl3 


r20 


#define 


REST 


rl3 


r21 


#define 


REST 


rl3 


r22 


#def ine 


REST 


rl3 


r23 


#define 


REST 


rl3 


r24 


#define 


REST 


rl3 


r25 


#define 


REST 


rl3 


r26 


#define 


REST 


rl3 


r27 


#define 


REST 


rl3 


r28 


#define 


REST 


rl3 


r29 


#def ine 


REST 


rl3 


r30 


#define 


REST_ 


, rl 3_ 


jr31 


#define 


SAVE 


rl4 




#def ine 


SAVE 


rl4 


rl5 
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tfdefine 


SAVE 


rl4 


rl6 


#def ine 


SAVE 


rl4 


rl7 


#def ine 


SAVE 


rl4 


rl8 


#def ine 


SAVE 


r!4 


rl9 


#def ine 


SAVE 


rl4 


r20 


#define 


SAVE 


rl4 


r21 


#def ine 


SAVE 


rl4 


r22 


#def ine 


SAVE 


rl4 


r23 


#def ine 


SAVE 


rl4 


r24 


#def ine 


SAVE 


rl4 


r25 


#def ine 


SAVE 


rl4 


r26 


#def ine 


SAVE 


rl4 


r27 


#def ine 


SAVE 


r!4 


r28 


#def ine 


SAVE 


rl4 


r29 


#def ine 


SAVE 


rl4 


r30 


#def ine 


SAVE 


rl4 


r31 


#def ine 


REST 


rl4 




ftdef ine 


REST 


rl4 


rl5 


ftdef ine 


REST 


rl4 


rl6 


#def ine 


REST 


r!4 


rl7 


#def ine 


REST 


rl4 


rl8 


#def ine 


REST 


r!4 


rl9 


#def ine 


REST 


rl4 


r20 


#def ine 


REST 


rl4 


r21 


#def ine 


REST 


rl4 


r22 


#def ine 


REST 


rl4 


r23 


#def ine 


REST 


rl4 


r24 


#define 


REST 


rl4 


r25 


#def ine 


REST 


rl4 


r26 


#def ine 


REST 


rl4 


r27 


#def ine 


REST 


rl4 


r28 


#def ine 


REST 


rl4 


r29 


#def ine 


REST 


rl4 


r30 


#def ine 


REST 


rl4 


r31 


#def ine 


SAVE 


rl5 




#def ine 


SAVE 


r!5 


rl6 


#def ine 


SAVE 


rl5 


rl7 


#define 


SAVE 


rl5 


rl8 


#def ine 


SAVE 


rl5 


rl9 


#def ine 


SAVE 


r!5 


r20 


#def ine 


SAVE 


rl5 


r21 


#define 


SAVE 


rl5 


r22 


^define 


SAVE 


rl5 


r23 


#def ine 


SAVE 


rl5 


r24 


#def ine 


SAVE 


rl5 


r25 


#def ine 


SAVE 


r!5 


r26 


#define 


SAVE 


rl5 


r27 


#def ine 


SAVE 


r!5 


r28 


ttdefine 


SAVE 


rl5 


r29 


#def ine 


SAVE 


rl5 


r30 


#define 


SAVE_ 


_rl5 


r31 


#def ine 


REST 


rl5 




#def ine 


REST 


rl5 


rl6 


#def ine 


REST 


rl5 


rl7 


#def ine 


REST 


rl5 


rl8 


#def ine 


REST 


rl5 


rl9 


#define 


REST 


rl5 


r20 


#def ine 


REST 


rl5 


r21 


#define 


REST 


rl5 


r22 


#def ine 


REST 


rl5 


r23 


#define 


REST 


rl5 


r24 


#define 


REST 


rl5 


r25 


#define 


REST 


rl5 


r26 


#def ine 


REST 


rl5 


r27 
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iio 


r<io 


•H-LlcJL-LIie 


IVCtO X 


rib 


r^y 


tfaeiine 


npcm 

KcjoX 


-•-"1 C 

rib 


r jo 


ffaeiine 


npprp 


_rib_ 


_r.il 


ttaenne 




rio 




ffaetine 


SAVE 


rl6 


rl7 


ttaeiine 




rlo 


rl8 


#def ine 


SAVE 


rl6 


rl9 


#define 


SAVE 


rl6 


r20 


ffdetine 


SAVE 


rl6 


r21 


ffaetine 


SAVE 


rl6 


r22 


ffaeiine 


SAVE 


rl6 


r23 


ffaetine 


SAVE 


rl6 


r24 


ffaetine 


SAVE 


rl6 


r25 


waetine 


SAVE 


rl6 


r26 


ffaetine 


SAVE 


rl6 


r27 


ffaetine 


SAVE 


rl6 


r28 


11 J „ ir • „ _ 

ffaetine 


SAVE 


rl6 


r29 


ffaetine 


SAVE 


rl6 


r30 


ffaetine 


SAVE_ 


_rl6_ 


_r31 


IIJ _ .p ' _ _ 

ffaetine 


REST 


rl6 




ffaetine 


REST 


r!6 


rl7 


ffaetine 


REST 


rl6 


rl8 


ffaetine 




rib 


rl9 


#def ine 


REST 


rl6 


r20 


#def ine 


REST 


rl6 


r21 


#define 


REST 


rl6 


r22 


#def ine 


REST 


rl6 


r23 


#def ine 


REST 


rl6 


r24 


ffdefine 


REST 


rl6 


r25 


#define 


REST 


rl6 


r26 


#define 


REST 


rl6 


r27 


#define 


REST 


rl6 


r28 


#define 


REST 


rl6 


r29 


#define 


REST 


rl6 


r30 


#define 


REST 


r!6 


r31 



/* 

* VMX registers 
*/ 



#def ine 


USE 


THRU 


v0( 


cond 


#def ine 


USE 


THRU 


vl( 


cond 


#def ine 


USE 


THRU 


v2 ( 


cond 


#define 


USE 


THRU 


V3( 


cond 


#define 


USE 


THRU 


v4 ( 


cond 


#def ine 


USE 


THRU 


v5( 


cond 


#define 


USE 


THRU 


v6( 


cond 


#define 


USE 


THRU 


v7( 


cond 


#def ine 


USE 


THRU 


v8( 


cond 


#define 


USE 


THRU 


v9 ( 


cond 


ffdefine 


USE 


THRU 


vlO 


[ cond 


ttdefine 


USE 


THRU 


vll 


[ cond 


ffdefine 


USE 


THRU 


vl2 


[ cond 


#def ine 


USE 


THRU 


vl3 


! cond 


ffdefine 


USE 


THRU 


vl4 


[ cond 


#define 


USE 


THRU 


vl5 


! cond 


#def ine 


USE 


THRU 


vl6 


[ cond 


#define 


USE 


THRU 


vl7 


[ cond 


tfdefine 


USE 


THRU 


vl8 


\ cond 


#def ine 


USE 


THRU 


vl9 


1 cond 


ffdefine 


USE 


THRU 


v20 


\ cond 


#define 


USE 


THRU 


v21 


\ cond 


ttdefine 


USE 


THRU 


v22 


\ cond 


#define 


USE 


THRU 


v23 


cond 


ffdefine 


USE 


THRU 


v24 


cond 
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^define 


USE THRU v25( 


cond ) 


&dpf ine 


USE THRU v26 ( 


cond ) 


iidpfi np 


USE THRU V27 ( 


cond ) 


iidpf i np 

tt *^IC1» X11C 


USE THRU v28{ 


cond ) 


ilfS F*"F t rip 


USE THRU v29( 


cond ) 


44 H o 'F *i tip 

1TUCL J. llC 


USE THRU v30 { 


cond ) 


tf Lie 1. J. lie 


USE THRU v31 ( 


cond ) 


tfXcxxne 


FREE 


THRU 


v0 ( 




44 /"^ o ■? no 
ttUcJLXile 


FREE 


THRU 


vl ( 


cono. / 


$def i ns 


FREE 


THRU 


v2 ( 


cond ) 


ftLieJ- Xi.it; 


FREE 


THRU 


v3 ( 


cond ) 


&dofi np 
ttixex J. lie; 


FREE 


THRU 


v4 ( 


cond ) 


tt ixex J. lie 


FREE 


THRU 


v5 ( 




iidofH no 


FREE 


THRU 


v6 ( 


cond ) 


•ft UCJU JL11C 


FREE 


THRU 


v7 ( 


cond ) 


tt*Xt2X -LlltJ 


FREE 


THRU 


v8( 


cono. ) 


iidof inp 

ft U.C1 .Lilt: 


FREE 


THRU 


v9 ( 


cond ) 


44H o f i no 
tttxc j. x lie 


FREE 


THRU 


VlO 


L cond 


iidofH no 

TflXC J. JL i Its 


FREE 


THRU 


vll 


i cond 


Jidofi no 

tt^XCX Xiie 


FREE 


THRU 


vl2 


k cond 


ildp"F t no 


FREE 


THRU 


vl3 


) con ^ 


iiHof i no 
trUei. JLILe 


FREE 


THRU 


vl4 


, cono. 


TrUCi -L lie 


FREE 


THRU 


vl5 


, cond 


£dpf i no 
ttUci. Xllc 


FREE 


THRU 


vl6 


, cond 


44 /~l ci t" ■? no 
tH-*.c X X lie 


FREE 


THRU 


vl7 


, cond 




FREE 


THRU 


vl8 


t cond 


itHpf 7 1 np 

tt'~»-e.L. ne 


FREE 


THRU 


vl9 


cond 


#def ine 


FREE 


THRU 


v20 


1 cond 


#define 


FREE 


THRU 


v21 


! cond 


#define 


FREE 


THRU 


v22 


[ cond 


#define 


FREE 


THRU 


v23 


! cond 


#define 


FREE 


THRU 


v24 


; cond 


#define 


FREE 


THRU 


V25 


! cond 


#def ine 


FREE 


THRU 


v26 


! cond 


#def ine 


FREE 


THRU 


v27 


! cond 


#def ine 


FREE 


THRU 


v28 


1 cond 


#define 


FREE 


THRU 


v2 9 


! cond 


#define 


FREE 


THRU 


v30 


1 cond 


#def ine 


FREE 


THRU 


v31 


! cond 



#endif /* end SALPPC_H */ 

/* 

* 

* 

* END OF FILE salppc.h 

* 

* 
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#if 1 defined ( SALPPCJENC ) 
#define SALPPC_INC 

#if 0 

+ ************************* 

'*** MC Standard Algorithms PPC Version ***| 
*********************** ******* *********************************** *******★*! 

* 

* File Name: salppc.inc *| 

Description: SAL macro include file 



+ 



* Source files should have extension .mac. For example, vadd.mac * 

and must include this file (salppc.inc). " * 



* 
★ 

* To assemble for PPC ucode, use the following basic * 

* makefile build rule: 
* 

* . SUFFIXES : .mac .c .s .o 

* * 
mac.o: * 

* ccmc -o $*.s -E $*.c * 

* ccmc -c-o$*.o$*.s * 
* 



* cp $*.mac $*.c 



rm -f $*.s 
rm -f $*.c 

* 

To compile for C, use the following basic makefile build rule: * 

. SUFFIXES : .mac .c .o * 

* 

.mac.o: * 
cp $*.mac $*.c * 

* ccmc -DCOMPILE_C -c -o $*.o $*.c * 
*. rm -£ $*.c * 

* 

The first 8 function arguments are passed in GPR registers * 
r3 - rlO. Arguments beyond 8 are passed on the stack and may * 
be obtained with the GET_ARG8, GET_ARG9, ... GET ARG15 macros. * 
Additional GPR registers should be assigned in ascending order * 
starting from the last function argument. These may be declared * 
with the DECLARE_rx[ ry] macros. For example, a function with * 
5 arguments that requires 3 additional GPR registers would * 

* issue: DECLARE r8 rlO. rO, if required, should be declared * 

* separately with the DECLARE rO macro. GPR registers above rl2 * 

* must be saved and restored using the SAVE_rl3 [_ry] and * 

* REST_rl3 [_ry] macros, respectively. - ^ 



FPR registers should be assigned in ascending order starting * 

with f 0 [dO] . These may be declared with the DECLARE f 0 [ fy] * 

or DECLARE do [ dy] macros. "* * 

For example, DECLARE fo fll. FPR registers above f 13 [dl3] must * 

be saved and restored using the SAVE f 14 [ fy] and REST f 14 [ fy] * 

or SAVE_dl4 [_dy] and REST_dl4 C_dy] macros, respectively. * 

All variables must be assigned a register using the * 

pre-processor #define directive. GPR registers are named * 

rO - r31; Single precision FPR registers are named fO - f31. * 

Double precision FPR registers are named dO - d31. Different * 

variables may be assigned to the same register as-in:- * 



* #define vara fl2 * 

* #def ine varb fl2 * 

* 

Functions must begin with the FUNC_PROLOG macro and end * 
with the FUNC EPILOG macro. ~ * 
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* Macros are provided for both Fortran and C entry points. * 

* " * 

* The GET SALCACHE macro should be used to get the address of * 

* the "current" salcache buffer into a GPR register. * 

* * 

* Avoid terminating macro lines with a semicolon. * 

* * 

* The following example demonstrates typical usage: * 

* ~ - * 

* #include "salppc.inc" * 

* * 

* /+ * 

* * assign variables to registers * 

* */ * 

* #define A r3 * 

* #define I r4 * 

* #define B r5 * 

* ^define J r6 * 

* #define C r7 * 

* #define K r8 * 

* #define D r9 * 

* . #define L rlO * 

* #define N rl2 * 

* #define EFLAG rll * 

* #define count rll * 

* * 

* #define tO rl3 * 

* #define tl rl3 * 

* #define t2 rl4 * 

* #define t3 rl4 * 
+ #define t4 rl5 * 

* #define t5 rl5 * 

* #define t6 rl6 * 

* * 

* #define aO fO * 

* #define al fl * 

* #define a2 f2 * 

* #define a3 f3 * 

* #define bO f4 * 

* #define bl f5 * 

* #define b2 f6 * 

* #define b3 f7 * 

* #define cO f8 * 

* #define cl f9 * 

* #define c2 flO * 

* #define c3 fll * 

* fcdefine do fl2 * 

* #define dl fl3 * 

* #define d2 fl4 * 

* #define d3 fl5 * 

* * 

* FUNC_PROLOG ' /* must precede function */ * 

* * 

* #if !defined( COMPILE__C ) * 

* U ENTRY (foo ) ~ * 

* FORTRAN DREF 4(1, J, K, L) * 

* F ORT RAN_D RE F_ARG 8 * 

* * 

* U ENTRY (foo) * 

* LI (EFLAG, 0) *. 

* BR (common) * 

* * 

* U ENTRY (foo x ) * 

* FORTRAN DREF 4(1, J, K, L) * 

* FORTRAN DREF ARG8 * 

* FORTRAN_DREF__ARG9 * 

* #endif * 



422 



WO 02/073937 



PCT/US02/08106 



salppc . inc 



3/9/2001 



ENTRY 10(foo X, A, I, B, J, C, K # D, L, N, EFLAG) 
DECLARE rl3 r!6 
DECIiARE fO fl5 
GET_ARG9( EFLAG ) 

LABEL (common) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE LR 



GET ARG8 { N ) 



/* get the 9'th arg (EFLAG) off stack */ * 

* 

★ 
* 

/* needed if using fields 2,3 or 4 */ * 

* 

/* needed if making a function call */ * 

* 

/* get the 8'th arg (N) off stack */ * 



body of function . . 



«7 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNC EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



* Revision 



Date 



Engineer; Reason 



0.0 960223 jg; Created 

0.1 970109 jfk; Added POSTING BUFFER COUNT and made 

TEST IF DCBZ macro time "stw" instead 
of doing the TEST IF DCBT macro (lwz) 
0.2 970124 jfk; Added SALCACHE ALLOC SIZE , 

ALIGN SALCACHE, CREATE__SALCACHE_FRAME 
DESTROY SALCACHE FRAME 
0.3 970521 jfk; Added SET DCB [TZ] COND macros. 

Made old macros not assemble 
0.4 980813 jfk; Changes SALCACHE ALLOC SIZE for 750 

+***********************************^ 

#endif /* header */ 

#if !defined( BUILD_603 ) && !defined( BUILD 750 ) && !defined( BUILD_MAX ) 

terror You must define BUILD__603 or BUILD_750 or BUILD MAX 
#endif 



/* 

* define single precision floating point field sizes, 

* limits, and values 
*/ 

#define F FLOAT SIZE 32 
#define F FRAC SIZE 23 
#define F HIDDEN SIZE 1 
#define F EXP SIZE 8 
#define F SIGN SIZE 1 

#define F SIGN BIT (F FLOAT SIZE - F SIGN SIZE) 
#define F EXP MASK ( (1 « F EXP SIZE) - 1) 
ttdefine F EXP BIAS ((1 « (FJ3XPJSIZE-1) ) - 1) 
^define F MAX EXP F EXP BIAS 
#define F_MIN_EXP ( - (FJ3XP_BIAS-1) ) 

/* 

* define double precision floating point field sizes, 

* limits, and values 
*/ 

#define D FLOAT SIZE 64 



423 



WO 02/073937 



PCT/US02/08106 



salppc.inc 3/9/2001 

#define D FRAC SIZE 52 
#define D HIDDEN SIZE 1 
#define D EXP SIZE 11 
#define D SIGN SIZE 1 

#define D SIGN BIT (D FLOAT SIZE - D SIGN SIZE) 
#define D EXP MASK ( (1 « D EXP SIZE) - 1) 
#define D EXP BIAS ({1 « (D_EXP_SIZE-1) ) - 1) 
#define D MAX EXP D EXP BIAS 
#define D_MIN_EXP ( - (D_EXPJBIAS-1) ) 

#if defined ( BUILD_603 ) 

#define L0G2_CACHE_SIZE (14) /* Log (base 2) of 603 data cache */ 

#elif defined ( BUILDJ750 ) | | defined ( BUILDJ4AX ) 

#define LOG2__CACHE_SIZE (15) /* Log (base 2) of 750 or MAX data cache 



#endif 



#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#define 

#define 
#define 
#define 
#define 
#define 
#.def ine 
#def ine 
#define 

#define 
#def ine 
ttdefine 
#define 
#def ine 



L0G2 
L0G2 
L0G2 
L0G2 
L0G2 
L0G2 
L0G2_ 

CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 



CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE 
CACHE_ 

SIZE 
BSIZE 
HSIZE 
LSIZE 
FSIZE 
DSIZE 
CSIZE 
ZSIZE 



BSIZE (LOG2 CACHE SIZE) 

HSIZE (LOG2 CACHE SIZE • 

LSIZE (L0G2 CACHE SIZE 

FSIZE (LOG2 CACHE SIZE • 

DSIZE (LOG2 CACHE SIZE • 

CSIZE (LOG2 CACHE SIZE • 

ZSIZE (LOG2_CACHE_SIZE - 

(1 « LOG2 CACHE_SIZE) 
(CACHE SIZE) 
(CACHE SIZE » 1) 
(CACHE SIZE » 2) 
(CACHE SIZE » 2) 
(CACHE SIZE » 3) 
(CACHE SIZE » 3) 
(CACHE SIZE » 4) 



1) 
2) 
2) 
3) 
3) 
4) 



L0G2 CACHE LINE_SIZE 5 

CACHE LINE SIZE (1 << LOG2 CACHE_LINE SIZE) 
CACHE LINE LSIZE (CACHE LINE SIZE >> 2) 
CACHE LINE MASK (CACHE LINE SIZE - 1) 
CACHE_LI NE_ADDR_MASK (OxffffffeO) 



#define LOG2 SAL CACHE ALIGN 6 

#define SAL CACHE ALIGN (1 « LOG2 SAL CACHE ALIGN) 
#def ine SALCACHE_ALIGN_MASK ( SALCACHE_ALIGN - 1) 



#define SALCACHE SIZE 
#define • SALCACHE EXTRA SIZE 
#define SALCACHE ALLOC SIZE 



CACHE SIZE 

(SALCACHE ALIGN + 64) 

(SALCACHEJSIZE + SALCACHE_EXTRA_SIZE) 



* Define memory vector non- cache (N) / cache (C) FLAG values for 

* Enhanced SAL calls (final argument) . The letters in the symbol 

* correspond to the vectors in the call, moving from left to right 

* so, for example: 

* for VMULX, there are the following 8 possibilities: 



* 


VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL NNN) 


A, B, C all not in 


cache 




VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL NNC) 


A, B not" in cache, 


C 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


Ji 


c, 


K, 


N, 


SAL NCN) 


A, C not in cache, 


B 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL NCC) 


A not in cache, B, 


C 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL CNN) 


B, C not in cache, 


A 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL CNC) 


B not in cache, A, 


C 


in 


cache 


* 


VMULX 


(A, 


I, 


B, 


J, 


c, 


K, 


N, 


SAL CCN) 


C not in cache, A, 


B 


in 


cache 
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* VMULX (A, I, B, J, C, K, N, SAL CCC) A, B , C all in cache 

*/ 



/* 

* 1 vector algorithms 
*/ 

#define SAL N 0 
#define SAL_C 1 

/* 

* 2 vector algorithms 
*/ 

#define SAL NN 0 

#define SAL NC 1 

#define SAL CN 2 

#define SAL_CC 3 



/* 

• * 3 vector algorithms 
*/ 



#define 


SAL 


NNN 


0 


#def ine 


SAL 


NNC 


1 


#define 


SAL 


NCN 


2 


#define 


SAL 


NCC 


3 


#define 


SAL 


CNN 


4 


#def ine 


SAL 


CNC 


5 


#def ine 


SAL 


CCN 


6 


#define 


SAL 


CCC 


7 


/* 








* 4 vector 

* / 


algorithms 


#define 


SAL 


NNNN 


0 


#def ine 


SAL 


NNNC 


1 


#def ine 


SAL 


NNCN 


2 


#def ine 


SAL 


NNCC 


3 


ttdefine 


SAL 


NCNN 


4 


#def ine 


SAL 


NCNC 


5 


#def ine 


SAL 


NCCN 


6 


#def ine 


SAL 


NCCC 


7 


#define 


SAL 


CNNN 


8 


#define 


SAL 


CNNC 


9 


#def ine 


SAL 


CNCN 


10 


#def ine 


SAL 


CNCC 


11 


#def ine 


SAL 


CCNN 


12 


tdefine 


SAL 


CCNC 


13 


#define 


SAL 


CCCN 


14 


#def ine 


SAL 


CCCC 


15 


/* 








* 5 vector 

* / 


algorithms 


#def ine 


SAL 


NNNNN 


0 


#def ine 


SAL 


NNNNC 


1 


#def ine 


SAL 


NNNCN 


2 


#def ine 


SAL 


NNNCC 


3 


#define 


SAL 


NNCNN 


4 


#def ine 


SAL 


NNCNC 


5 


#def ine 


SAL 


NNCCN 


6 


#define 


SAL 


NNCCC 


7 


#def ine 


SAL 


NCNNN 


8 


#def ine 


SAL 


NCNNC 


9 


#define 


SAL 


NCNCN 


10 


#define 


SAL 


NCNCC 


11 


#def ine 


SAL 


NCCNN 


12 


#def ine 


SAL 


NCCNC 


13 


#def ine 


SAL 


NCCCN 


14 
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#define 


SAL 


NCCCC 


15 


#define 


SAL 


CNNNN 


16 


#def ine 


SAL 




17 


#def ine 


SAL 


CNNCN 


18 


#def ine 


SAL 


CNNCC 


19 


#def ine 


SAL 


CNCNN 


20 


#def ine 


SAL 


CNCNC 


21 


#def ine 


SAL 


CNCCN 


22 


#def ine 


SAL 


CNCCC 


23 


#def ine 


SAL 


CCNNN 


24 


#define 


SAL 


CCNNC 


25 


#define 


SAL 


CCNCN 


26 


#define 


SAL 


CCNCC 


27 


#define 


SAL 


CCCNN 


28 


#define 


SAL 


CCCNC 


29 


#define 


SAL 


CCCCN 


30 


#define 


SAL 


CCCCC 


31 


/* 
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define byte offsets into FFT_setup_jppc603e 



#def ine 


FFT 


SETUP 


HANDLE 


0 


#def ine 


FFT 


SETUP 


SMALL TWIDP 


4 


#def ine 


FFT 


SETUP 


SMALL BITR TWIDP 


8 


ftdef ine 


FFT 


SETUP 


SMALL LOG2M 


12 


#def ine 


FFT 


SETUP 


BIG TWIDP 


16 


#def ine 


FFT 


SETUP 


BIG XY TWIDP 


20 


#def ine 


FFT 


SETUP 


BIG LOG2MXY 


24 


#def ine 


FFT 


SETUP 


BIG LOG2X 


28 


#define 


FFT 


SETUP 


BIG LOG2Y 


32 


#define 


FFT 


SETUP 


BIG STRIPX 


36 


#define 


FFT 


SETUP 


RPASS TWIDP 


40 


#define 


FFT 


SETUP 


RADIX3 TWIDP 


44 


#define 


FFT 


SETUP 


RADIX5 TWIDP 


48 


#def ine 


FFT 


SETUP 


LOG2M 


52 


#def ine 


FFT 


SETUP 


LOG2MR 


56 


#define 


FFT 


SETUP 


VMX BITR TWIDP 


60 


#define 


FFT 


SETUP 


VMX TABLES 


64 



* ASIC equates 
*/ 

#define ASIC H 



-1024 



#def ine 
ftdefine 
#define 

#define 
tfdefine 
#def ine 

#def ine 
#define 
#define 
#define 
#define 
#define 
#define 
#define 

#define 
#define 
#define 
#define 
#def ine 
#def ine 



PREFETCH CONTROL (OxFBFFFEOO) 
PREFETCH CONTROL H -1024 
PREFETCH CONTROL L -512 



MISCON B 
MI SCON B H 
MISCON_B_L 

PREFETCH DISABLED 
PREFETCH AUTO 6 
PREFETCH AUTO 5 
PREFETCH AUTO 4 
PREFETCH AUTO 3 
PREFETCH AUTO 2 
PREFETCH AUTO 1 
PRE FETCH_AUTO__0 

PREFETCH MANUAL 0 
PREFETCH MANUAL 2 
PREFETCH MANUAL 4 
PREFETCH MANUAL 6 
PREFETCH MANUAL 8 
PREFETCH MANUAL 10 



( OxFBFFFCl 8 ) 

-1024 

-1000 

0 
1 

2 
3 
4 
5 
6 
7 

8 
9 

10 
11 
12 
13 



/* (OxFBFF + 1) */ 



/* (OxFBFF + 1) */ 
/* (OxFEOO) */ 



/* (OxFBFF + 1) */ 
/* (0xFC18) */ 
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#def ine 
#def ine 

#def ine 
#define 



PREFETCH MANUAL 12 
PREFETCH MANUAL 14 



14 
15 



USE PREFETCH_CONTROL 16 
USE MISCON B 0 



tfdefine PREFETCH MASK 



#define 
#define 
#define 
#def ine 
#define 
#def ine 
#def ine 
#define 
#def ine 

ftdefine 
#define 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#define 



PREFETCH DEFAULT 
PREFETCH OFF 
PREFETCH A6 
PREFETCH A5 
PREFETCH A4 
PREFETCH A3 
PREFETCH A2 
PREFETCH Al 
PREFETCH AO 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



MO 
M2 
M4 
M6 
M8 
M10 
Ml 2 
Ml 4 



15 

(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE_ 

(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH^ 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 

CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH^ 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



MANUAL 0) 
DISABLED) 
AUTO 6) 
AUTO 5) 
AUTO 4) 
AUTO 3) 
AUTO 2) 
AUTO 1) 
_AUTO_0) 

MANUAL 0) 
MANUAL 2) 
MANUAL 4) 
MANUAL 6) 
MANUAL 8) 
MANUAL 10) 
MANUAL 12) 
MANUAL 14) 



macro to compile for PPC assembly (COMPILE__C *not* defined) or 
C code (COMPILE C defined) 



*/ 

#if defined ( COMPILE_C ) 
# include "salppc.h" 
#else 



* GPR 


register equates 


*/ 






#def ine 


r0 


0 


#define 


sp 


1 


#define 


rtoc 


2 


#define 


r3 


3 


#define 


r4 


4 


#def ine 


r5 


5 


#def ine 


r6 


6 


#def ine 


r7 


7 


#def ine 


r8 


8 


#def ine 


r9 


9 


#def ine 


no 


10 


#def ine 


rll 


11 


#define 


rl2 


12 


#def ine 


rl3 


13 


#def ine 


rl4 


14 


#def ine 


rl5 


15 


#def ine 


rl6 


16 


#def ine 


rl7 


17 


#def ine 


rl8 


18 


#def ine 


rl9 


19 


#def ine 


r20 


20 


#def ine 


r21 


21 


#def ine 


r22 


22 


#def ine 


r23 


23 


#define 


r24 


24 


#def ine 


r25 


25 


#def ine 


r26 


26 
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#define r27 27 

#define r28 28 

#define r29 29 

#define r30 30 

#define r31 31 

/* 

* FPR single precision register equates 
*/ 



ffuctine 


x u 


U 




J. X 


X 


tfaeixne 


f 9 


o 


■ffU.tr J. Xlle 


L 3 


•3 


4i/^o-Fn no 


r^± 




tfaerxne 


•Ft; 




?f uei me 


L D 


b 


tt/iq^ i n^ 


J- / 


/ 


tt/^^F"i n^ 
tr Ucl Xlltz 


I a 


o 
D 


^define 


f Q 


Q 


4+ /-J pf 1 np 


f 10 


10 


#def ine 


fll 


11 


#def ine 


fl2 


12 


#def ine 


f!3 


13 


#define 


fl4 


14 


#def ine 


fl5 


15 


#def ine 


fl6 


16 


#def ine 


fl7 


17 


#def ine 


fl8 


18 


#define 


fl9 


19 


#define 


f20 


20 


#define 


f21 


21 


#define 


f22 


22 


#define 


f23 


23 


#define 


f24 


24 


#define 


f25 


25 


#def ine 


f26 


26 


#define 


f27 


27 


#def ine 


f28 


28 


#def ine 


f29 


29 


#def ine 


f30 


30 


#def ine 


f31 


31 



/* 

* FPR double precision register equates 
*/ 

#define do 0 

#define dl 1 

#define d2 2 

#define d3 3 

#define d4 4 

#define d5 5 

ttdefine d6 6 

#define d7 7 

#define d8 8 

#define d9 9 

#define dlO 10 

ttdefine dll 11 

#define dl2 12 

ttdefine dl3 13 

^define dl4 14 

ttdefine dl5 15 

#define dl6 16 

#define dl7 17 

#define dl8 18 

#define dl9 19 

#define d20 20 

#define d21 21 
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#define 622 
#define d23 
#define d24 
#define d25 
#define d26 
#define d27 
#define d28 
#define d29 
#define d30 
#define d31 



22 
23 
24 
25 
26 
27 
28 
29 
30 
31 



#if defined { BUILDjyiAX ) 



/* 
* VMX 

*/ 
fldefine 
#define 
#def ine 
ttdefine 
#def ine 
#def ine 
#def ine 
#def ine 
ftdefine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#define 
#define 
#define 
#define 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#define 
#define 
#def ine 
#define 
#define 



(g4) register equates 



v0 

vl 

v2 

v3 

v4 

v5 

v6 

v7 

v8 

v9 

vlO 

vll 

vl2 

vl3 

vl4 

vl5 

vl6 

v!7 

vie 

vl9 
v20 
V21 
v22 
v23 
v24 
v25 
v26 
v27 
v28 
v29 
v30 
v31 



0 
1 
2 
3 
4 
5 
6 
7 
8 
9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 



#endif 

#define FUNC PROLOG \ 
.section .text; \ 
.align 5; 

#define FUNC_EPILOG 

#define TEXT SECTION ( logb2_align ) \ 
.section .text; \ 
.align logb2_align; 

#define DATA SECTION ( logb2_align ) \ 
.section .data; \ 
.align logb2_align; 

#define RODATA SECTION ( logb2_align ) \ 
.section .rodata; \ 
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.align logb2_align; 

#define PC_OPFSET( nbytes ) (. + (nbytes) ) 
/* 

* make a "double" concat to fool the preprocessor so that input 

* arguments get translated before concatenation; otherwise, the 

* concatenated symbol doesn't get translated properly 
*/ 

#define CONCAT < left, right ) CONCAT NEST< left, right ) 
#define CONCAT_NEST( left, right ) left##right 

/* 

* macro for extern declarations and definitions 
*/ 

#define EXTERN_DATA( symbol ) 
#define EXTERN_FUNC( func ) 
/* 

* macro for a global declaration 
*/ 

#define GLOBAL ( symbol ) \ 
.globl symbol 

/* 

* macro for a local declaration 
*/ 

#define LOCAL ( symbol ) 
/* 

*^ macros for creating static arrays 

#define START_ARRAY( name ) \ 
name## : 

ttdefine START C ARRAY ( name ) START ARRAY ( name ) 

#define START UC ARRAY ( name ) START ARRAY ( name ) 

#define START S ARRAY ( name ) START ARRAY ( name ) 

#define START US ARRAY ( name ) START ARRAY ( name ) 

#define START L ARRAY { name ) START ARRAY ( name ) 

#define START UL ARRAY ( name ) START ARRAY ( name ) 

#define START_F_ARRAY ( name ) START_ARRAY { name ) 

#define END_ARRAY 

#define DATA ( type, dl ) \ 
.##type dl 

#define DATA2 ( type, dl, d2 ) \ 
.##type dl, d2 

#define DATA4 ( type, dl, d2, d3, d4 ) \ 
.##type dl, d2, d3, d4 

#define DATA8 ( type, dl, d2, d3, d4, d5, dS t dl , d8 ) \ 
.##type dl, d2, d3 , d4, d5, d6, d7, d8 

#define C DATA( dl ) DATA{ byte, dl ) 

#define UC DATA( dl ) DATA( byte, dl ) 

#define S DATA( dl ) DATA( short, dl ) 

#define US DATA( dl ) DATA ( short, dl ) 

#define L DATA{ dl ) DATA{ long, dl ) 

#define UL DATA( dl ) DATA( long, dl ) 

#define F_DATA{ dl ) DATA{ float, dl ) 

#if defined ( LITTLE_ENDIAN ) 
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#define D_DATA( dl f d2 ) 
#else 

#define D_DATA( dl, d2 ) 
#endif 



DATA2 ( long, d2, dl ) 
DATA2 ( long, dl, d2 ) 



#define C DATA2 { dl, d2 ) 
#define UC DATA2 ( dl, d2 ) 
#define S DATA2 { dl, d2 ) 
#define US DATA2 ( dl, d2 ) 
#define L DATA2 { dl, d2 ) 
#define UL D ATA2 ( dl, d2 ) 
#define F DATA2 { dl, d2 ) 



D ATA2 ( byte, dl, d2 ) 
D ATA2 ( byte, dl, d2 ) 
DATA2 ( short, dl, d2 ) 
DATA2 ( short, dl, d2 ) 
D ATA2 ( long, dl, d2 ) 
D AT A2 { long, dl, d2 ) 
DATA2 ( float, dl, d2 ) 



#define C DATA4 ( dl, d2, d3, d4 ) 
#define UC DAT A4 ( dl, d2, d3, d4 ) 
#define S DATA4 { dl, d2, d3, d4 ) 
#define US DATA4 ( dl, d2, d3, d4 ) 
#define L DATA4 ( dl, d2, d3, d4 ) 
#define UL DATA4 ( dl, d2, d3, d4 ) 
#define F DATA4 ( dl, d2, d3, d4 ) 



DATA4 ( byte, dl, d2, d3, d4 ) 
DATA4 ( byte, dl, d2, d3, d4 ) 
DATA4 { short, dl, d2, d3, d4 ) 
DATA4 ( short, dl, d2, d3, d4 ) 
DATA4 ( long, dl, d2, d3 , d4 ) 
DATA4 ( long, dl, d2, d3, d4 ) 
DATA4 ( float, dl, d2, d3, d4 ) 



#def ine 
#def ine 
#define 
#def ine 
#def ine 
#define 
#define 



C DATA8C dl, 
DATA8 { byte, 
UC DATA8 ( dl 
DATA 8 ( byte, 
S DATA8 ( dl, 
DATA8 ( short 
US DATA8 ( dl 
DATA8 ( short 
L DATA 8 { dl, 
DATA8 ( long, 
UL DATA8 ( dl 
DATA8 ( long, 
F DATA 8 ( dl, 
DATA8 { float 



d2, d3, 
dl, d2, 
, d2, d3, 
dl, d2, 
d2, d3, 
, dl, d2, 
, d2 , d3 , 
, dl, d2, 
d2, d3, 
dl, d2, 
, d2, d3, 
dl, d2, 
d2, d3, 
, dl, d2, 



d4, d5, 
d3, d4, 
d4, d5 
d3, d4, 
d4, d5, 
d3, d4 
d4, d5 
d3, d4 
d4, d5, 
d3, d4, 
d4, d5 
d3, d4, 
d4, d5, 
d3, d4 



d6, d7, 
d5, d6, 
, d6, d7, 
d5, d6, 
d6, d7, 
, d5, d6, 
, d6, d7, 
, d5, d6, 
d6, d7, 
d5, d6, 
, d6, d7, 
d5, d6, 
d6, d7, 
, d5, d6, 



d8 ) \ 

d7, d8 ) 
d8 ) \ 

d7, d8 ) 

d8 ) \ 
d7, d8 ) 
d8 ) \ 
d7, d8 ) 

d8 ) \ 

d7, d8 ) 
d8 ) \ 

d7, d8 ) 

d8 ) \ 
d7, d8 ) 



* macros for creating vmx permute masks (128 -bits) 
*/ 

#if defined ( LITTLE_ENDIAN ) 



#def ine 


L 


PERMUTE MUNGE ( 


1 


) 


{ (1) 


A Oxlclclclc 


#def ine 


S 


PERMUTE MUNGE ( 


s 


) 


{ (s) 


A Oxlele ) 


#define 


C_ 


_PERMUTE_MUNGE ( 


c 


) 


( (c) 


A Oxlf ) 


#def ine 


L 


INDEX MUNGE ( x 


) 


( 


(x) A 


0x3 ) 


#define 


S 


INDEX MUNGE ( x 


) 


{ 


(x) A 


0x7 ) 


#define 


c_ 


JENDEX_MUNGE ( x 


) 


( 


(x) A 


Oxf ) 


#else 














#def ine 


L 


PERMUTE MUNGE ( 


1 


) 


( 1 ) 




#def ine 


S 


PERMUTE MUNGE ( 


s 


) 


( s ) 




#def ine 


C_ 


_PERMUTE_MUNGE ( 


c 


) 


( c ) 




#def ine 


L 


INDEX MUNGE ( X 


) 


( 


X > 




#def ine 


S 


INDEX MUNGE ( X 


) 


( 


x ) 




#def ine 


C 


INDEX MUNGE ( x 


) 


( 


x ) 





#endif 

^define L PERMUTE MASK( 11, 12, 13, 14 ) \ 
.long L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE ( 12 ) , \ 
L_PBRMUTE J4UNGE ( 13 ), L_PERMUTE_MUNGE ( 14 ) 

#define S PERMUTE MASK( si, s2, s3, s4, s5, s6, s7, s8 ) \ 
.Short S PERMUTE MUNGE ( Si ) , S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE ( S3 ) 
S PERMUTE MUNGE ( s5 ) 
S PERMUTE MUNGE ( s7 ) 



S PERMUTE MUNGE ( s4 ) , \ 
S PERMUTE MUNGE ( s6 ) , \ 
SJPERMUTE MUNGE { s8 ) 



tfdefine C_PERMUTE_MASK ( cl, c2, c3, c4, c5, c6, c7, c8, \ 

c9, clO # ell, cl2, cl3, cl4, cl5, 
.byte C PERMUTE MUNGE ( cl ) , C PERMUTE MUNGE < c2 ) , \ 

C PERMUTE MUNGE ( c4 ) , \ 
C PERMUTE MUNGE ( c6 ) , \ 
C PERMUTE MUNGE ( c8 ) , \ 
C PERMUTE MUNGE { clO ), \ 
C PERMUTE MUNGE ( cl2 ) , \ 
C PERMUTE MUNGE ( Cl4 ) , \ 



cl6 ) \ 



PERMUTE MUNGE ( c3 ) , 
PERMUTE MUNGE ( c5 ) , 
PERMUTE MUNGE ( c7 ) , 
PERMUTE MUNGE ( C9 ), 
PERMUTE MUNGE ( ell ), 
PERMUTE MUNGE ( Cl3 ), 



C_PERMUTE_MUNGE ( cl5 ), C_PERMUTE_MUNGE ( cl6 ) 



* macro for a microcode entry point (e.g. vaddx, vaddx ) 

* U ENTRY is a "nop" for C code 
*/ 

#define U ENTRY ( func_name ) \ 
.globl func_name; \ 
func name: 



macros for C function prototypes 



#define C PROTOTYPE 0( func name 
#def ine C PROTOTYPE 1 ( func name 
#define C PROTOTYPE 2( func name 
#define C PROTOTYPE 3{ func name 
#define C PROTOTYPE 4( func name 
#define C PROTOTYPE 5( func name 
#define C PROTOTYPE 6( func name 
#def ine C PROTOTYPE 7 ( func name 
#def ine C PROTOTYPE 8 ( func name 
#define C PROTOTYPE 9( func name 
#define C PROTOTYPE 10 ( func name ) 
#define C PROTOTYPE 11 ( func name ) 
tfdefine C PROTOTYPE 12 ( func name ) 
#define C PROTOTYPE 13 ( func name ) 
#define C PROTOTYPE 14 ( func name ) 
#define C PROTOTYPE 15 { func name ) 
#define C_PROTOTYPE__16 ( func_name ) 



macros for C and Fortran callable entry points 



#define ENTRY 0( func__name ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 1( func_name, argO ) \ 
.globl func_name; \ 
f unc__name : 

#define ENTRY 2( func__name, argO, argi .) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 3( func_name, argO, argl, arg2 ) \ 
.globl func_name; \ 
f unc_name ; 

#define ENTRY 4( func_name, argO, argl, arg2 # arg3 ) \ 
.globl func_name; \ 
func name: 
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#define ENTRY 5{ funcjiame, argO, argl, arg2, arg3, arg4 ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 6( funcjiame, argO, argl, arg2, arg3, arg4, arg5 ) \ 
.globl func_narne; \ 
f unc_name : 

#define ENTRY_7( funcjiame, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6 ) \ 

.globl func__name; \ 
f uncjiame : 

#define ENTRY_8 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8 ) \ 

.globl funcjiame; \ 
func_name: 

#define ENTRY_10( funcjiame, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9 ) \ 

.globl func_name; \ 
f uncjiame: 

#define ENTRY_11( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO ) \ 

.globl func_name; \ 
f uncjname : 

#define ENTRY_12 ( f uncjname, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_13 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl 2 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_14 ( funcjiame, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_15 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

argS, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, arg!4 ) \ 

.globl func_name; \ 
func_name: 

#define ENTRY_16 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4, arglS ) \ 

..globl funcjname; \ 
f unc_name : 

/* 

* macros to de- reference any set of the first 8 arguments 

* passed by reference to the Fortran entry point but by 

* value to the corresponding C entry point 
*/ 
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#define FORTRAN DREF 1{ argO ) \ 
lwz argO, O(argO); 

#define FORTRAN DREF 2( argO, argl ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; 

#define FORTRAN DREF 3( argO, argl, arg2 ) \ 
lwz argO, O(argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0(arg2); 

#define FORTRAN DREF 4( argO, argl, arg2, arg3 ) \ 
lwz argO, 0{argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0(arg3); 

#define FORTRAN DREF 5( argO, argl, arg2 , arg3, arg4 ) \ 
lwz argO, O(argO); \ 
lwz argl, O(argl); \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0(arg3) ; \ 
lwz arg4, 0 (arg4) ; 

ttdefine FORTRAN DREF 6 ( argO, argl, arg2, arg3, arg4, argS ) \ 

lwz argO, O(argO); \ 

lwz argl, 0 (argl) ; \ 

lwz arg2, 0(arg2); \ 

lwz arg3, 0(arg3) ; \ 

lwz arg4, 0(arg4) ; \ 

lwz arg5 , 0 (arg5) ; 

#define FORTRAN DREF 7( argO, argl, arg2, arg3, arg4, argS, arg6 ) \ 
lwz argO, 0 (argO) ; \ 
lwz argl, O(argl); \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0(arg3); \ 
lwz arg4, 0(arg4); \ 
lwz arg5, 0 (arg5) ; \ 
lwz arg6, 0(arg6); 

#define FORTRAN DREF 8{ argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0(arg2); \ 
lwz arg3, 0(arg3) ; \ 
lwz arg4, 0(arg4); \ 
lwz argS, 0 (arg5) \ 
lwz arg6, 0(arg6); \ 
lwz arg7, 0 (arg7) ,- 

/* 

* macros to de- reference specific arguments beyond the first 8 

* passed by value to the C entry point 
*/ 

#define ARG_OFF (8 - 8*4) 

#define FORTRAN DREF_ARG8 \ 

lwz rl2, (ARG OFF + 8*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 8*4) (sp) ; 

#define FORTRAN DREF__ARG9 \ 

lwz rl2, (ARG OFF + 9*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw r!2, (ARG_OFF + 9*4) (sp) ; 
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#define FORTRAN DREF_ARG10 \ 

lwz rl2, (ARG OFF + 10*4) (sp) ; \ 

lwz rl2 f 0(rl2) ; \ 

StW rl2 # (ARG_OFF + 10*4} (sp) ; 

#def ine FORTRAN DREF_ARG11 \ 

lwz rl2, (ARG OFF + 11*4) (sp) ; \ 

lwz rl2, 0(rl2); \ 

Stw rl2, (ARGJDFF + 11*4) (sp); 

#define FORTRAN DREF_ARG12 \ 

lwz rl2, (ARG OFF + 12*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARGJ3FF + 12*4) (sp); 

#define FORTRAN DREF_ARG13 \ 

lwz rl2, (ARG OFF + 13*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF -f 13*4) (sp) ,- 

#def ine FORTRAN DREF_ARG14 \ 

lwz rl2, (ARG OFF + 14*4) (sp) ; \ 

lwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG__OFF + 14*4) (sp) ; 

#def ine FORTRAN DREF_ARG15 \ 

lwz rl2, (ARG OFF + 15*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARGJDFF + 15*4) (sp) ; 

#define FORTRAN DREF_ARG16 \ 

lwz rl2, (ARG OFF + 16*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 16*4) (sp) ; 

#def ine FORTRAN DREF_ARG17 \ 

lwz rl2, (ARG OFF + 17*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 17*4) (sp) ; 



/* 
* 



macros to get GPR arguments beyond 8 



#def ine GET ARG8 ( rD ) 
#define GET ARG9 ( rD ) 
#def ine GET ARG10 ( rD ) 
#define GET ARG11 ( rD ) 
fcdefine GET ARG12 ( rD ) 
#def ine GET ARG13 ( rD ) 
#define GET ARG14 { rD ) 
#define GET ARG15{ rD ) 
#define GET ARG16 ( rD ) 
#define GET_ARG17 ( rD ) 



lwz 
lwz 
lwz 
lwz 
lwz 
lwz 
lwz 
lwz 
lwz 
lwz 



rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 
rD, 



/* 



macros to set GPR arguments beyond 8 



#def ine SET ARG8 ( rD ) 
#define SET ARG9 ( rD ) 



#define SET ARG10( rD ) 
#def ine SET ARG11 ( rD ) 
^define SET ARG12 ( rD ) 
#define SET ARG13 ( rD ) 
#define SET ARG14 ( rD ) 
#define SET ARG15 ( rD ) 
#define. SET_ARG16< rD ) 



stw rD, 
Stw rD, 
stw rD, 
stw rD, 
stw rD, 
stw rD, 
stw rD, 
stw rD, 
stw rD, 



(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 



(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 
(ARG 



OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 



OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 
OFF 



8*4) (sp) ; 
9*4) (sp); 
10*4) (sp) 
11*4) (sp) 
12*4) (sp) 
13*4) (sp) 
14*4) (sp) 
15*4) (sp) 
16*4) (sp) 
17*4) (sp) 



8*4) (sp) ; 
9*4) (sp) ; 
10*4) (sp). 
11*4) (sp) ; 
12*4) (sp) ; 
13*4) (sp) , 
14*4) (sp) ; 
15*4) (sp) ) 
16*4) (sp) j 
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fcdefine SET_ARG17 ( rD ) s tw rD, (ARG_OFF + 17*4) (sp) ; 

/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC( func_name ) \ 
b funcjiame; 

/* 

* macros to call functions 
*/ 

#define CALL FUNC ( func_name ) \ 
bl func_name; 

#define CALL 0( func name ) \ 
CALL_FUNC ( func_name ) 

#define CALL 1( func name, argO ) \ 
CALL_FUNC ( func_jiame ) 

#define CALL 2( func name, argO, argl } \ 
CALL_FUNC ( funcjiame ) 

#define CALL 3( func name, argO, argl, arg2 ) \ 
CALL_FUNC( func_name ) 

#define CALL 4( func name, argO, argl, arg2, arg3 ) \ 
CALL_FUNC { func_name ) 

#define CALL 5( func name, argO, argl, arg2, arg3, arg4 ) \ 
CALLJFUNC ( func_name ) 

#define CALL 6( func name, argO, argl, arg2, arg3, arg4 , argS ) \ 
CALL_FUNC ( func_name ) 

#define CALL 7( func name, argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 
CALL_FUNC ( func_name ) 

#define CALL 8( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 ) \ 
CALL_FUNC< func_name ) 

#define CALL_9 ( func_name, argO, argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
arg8 ) \ 
CALL_FUNC< func__name ) 

#define CALL_10( func name, argO/ argl, arg2, arg3, arg4, argS, arg6, arg7, \ 
arg8, arg9 ) \ 
CALL_FUNC ( func_name ) 

#define CALL_11( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO ) \ 
CALL_FUNC ( funcjiame ) 

#define CALL_12 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil ) \ 
CALL_FUNC { func__name ) 

#define CALLJL3 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 ) \ 
CALL_FUNC( func_name ) 

#define CALLJL4 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ) \ 
CALL_FUNC( func_name ) 

#define CALLJL5 ( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
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#define CALL_16( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, arglS ) \ 
CALL_FUNC( func_name ) 

#if defined ( BUILD MAX ) 

#if defined ( COMPILE_ESAL_JUMP_TABLE ) 



* G4 macros to create an ESAL jump table for 1, 2, 3 and 4 vector 

* algorithms. The table name is <root_name>_jump and is made a 

* local symbol . (not supported in C) 
*/ 

#define DECLARE VMX_V1 ( rootjname ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT( root name, jump ) : \ 
.long C0NCAT( root name, n ); \ 
.long CONCAT{ root_name, _c ); 

#define DECLARE VMX_V2 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT( root name, jump ) : \ 
.long CONCAT( root name, nn ) ; \ 
.long CONCAT( root name, nc ) ; \ 
.long CONCAT ( root name, cn ) ; \ 
.long CONCAT ( root_name, __cc ) ; 

#define DECLARE VMX_V3 ( root_name ) \ 
.section .rodata,- \ 
.align 5; \ 



root name. 


jump 


): \ 




CONCAT ( 


root 


name, 


nnn ) ; 


\ 


CONCAT ( 


root 


name, 


nnc ) ; 


\ 


CONCAT ( 


root 


name, 


ncn ) ; 


\ 


CONCAT ( 


root 


name, 


ncc ) ; 


\ 


CONCAT ( 


root 


name, 


cnn ) ,- 


\ 


CONCAT { 


root 


name, 


cnc ) ; 


\ 


CONCAT ( 


root 


name, 


ccn ) ; 


\ 


CONCAT ( 


root. 


name , 


_ccc ) ; 




DECLARE VMX_ 


_V4 ( root_name 


) 



.long 
. long 
.long 
.long 
.long 
.long 
.long 
.long 



.section .rodata; \ 
.align 5; \ 



CONCAT ( root name, 


jump 


): \ 






.long 


CONCAT { 


root 


name, 


nnnn 


>; 


\ 


.long 


CONCAT ( 


root 


name, 


nnnc 


) ; 


\ 


.long 


CONCAT ( 


root 


name, 


nncn 


); 


\ 


.long 


CONCAT ( 


root 


name, 


nncc 


); 


\ 


.long 


CONCAT ( 


root 


name, 


ncnn 


) ; 


\ 


.long 


CONCAT { 


root 


name, 


ncnc 


) ; 


\ 


. long 


CONCAT { 


root 


name, 


nccn 


) ; 


\ 


. long 


CONCAT ( 


root 


name, 


nccc 


) ; 


\ 


.long 


CONCAT ( 


root 


name, 


cnnn 


) ? 


\ 


.long 


CONCAT ( 


root 


name, 


cnnc 


) ; 


\ 


.long 


CONCAT ( 


root 


name, 


cncn 


) r 


\ 


. long 


CONCAT ( 


root 


name, 


cncc 


) ; 


\ 


.long 


CONCAT { 


root 


name, 


ccnn 


) ; 


\ 


.long 


CONCAT ( 


root 


name, 


ccnc 


); 


\ 


.long 


CONCAT { 


root 


name, 


cccn 


) ; 


\ 


.long 


CONCAT ( 


root_ 


_name , 


_cccc 


); 





#def ine DECLARE VMX_V5 ( rootjname ) \ 
.section .rodata; \ 
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.align 5; \ 
CONCAT( 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
. long 
. long 
.long 
. long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 
.long 

#def ine DECLARE VMX Zl ( root name ) 
#def ine DECLARE VMX Z2 ( root name ) 
#def ine DECLARE VMX Z3 ( root name ) 
#define DECLARE VMX Z4 ( root name ) 
#define DECLARE_VMX_Z5 ( rootjiame ) 

/* 



3/9/2001 



root name, 


jump ) 


; \ 


\ 




root 


name, 


nnnnn ) 


CONCAT [ 


root 


name , 


nnnnc ) 


• \ 


CONCAT ( 


root 


name , 


nnncn ) 


\ 


CONCAT ( 


root 


name , 


nnncc ) 




CONCAT ( 


root 


name . 


nncnn ) 


\ 


CONCAT ( 


root 


name 


nncnc ) 


\ 


CONCAT ( 


root 


name , 


nnrrn ) 




CONCAT ( 


root 


■n 3 mo 
name- , 


nnfrr ) 




CONCAT ( 


root 




nrnnn ) 

ilV^i 111.11 / 


\ 


CONCAT ( 


root 




ncnnc ) 


\ 


CONCAT ( 


root 


name , 


ncncn ) 


. \ 


CONCAT ( 


root 




n CTt c* c \ 

1 1 V- J. X / 




CONCAT ( 


root 


name » 




. \ 


CONCAT ( 


root 


name , 


nrr'Tir 1 ) 


. \ 


CONCAT ( 


root 


name , 


ncccn ) 


\ 


CONCAT ( 


root 




ncccc ) 


\ 


CONCAT v 


root 


name / 


pnnrm \ 


\ 


CONCAT ( 


root 


name , 


rrrn n c \ 

\-r 1 1.X 1 1 1 / 


\ 


CONCAT ( 


root 


name, 


cnncn ) 


\ 


CONCAT ( 


root 


name, 


cnncc ) , 


• \ 


CONCAT ( 


root 


name, 


cncnn ) 


\ 


CONCAT ( 


root 


name, 


cncnc- ) 


• \ 


CONCAT < 


root 


name, 


cnccn ) 


• \ 


CONCAT ( 


root 


name, 


cnccc ) 


• \ 


CONCAT ( 


root 


name, 


ccnnn ) , 


\ 


CONCAT ( 


root 


name, 


ccnnc ) 


\ 


CONCAT ( 


root 


name, 


ccncn ) 


• \ 


CONCAT ( 


root 


name, 


ccncc ) 


• \ 


CONCAT ( 


root 


name, 


cccnn ) 


• \ 


CONCAT ( 


root 


name, 


cccnc ) 


• \ 


CONCAT ( 


root 


name, 


ccccn ) 


• \ 


CONCAT ( 


root. 


name , 


__ccccc ) 





DECLARE VMX VI { root name ) 
DECLARE VMX V2 ( root name } 
DECLARE VMX V3 { root name ) 
DECLARE VMX V4 ( root name ) 
DECLARE_VMX_V5 ( root name ) 



* G4 macros to branch through the <root name> jump table based on 

* the value of the ESAL flag, (not supported in C) 

* (uses rO as scratch and destroys eflag) 

* (not supported in C) 
*/ 

#define BR ESAL_JUMP TABLE_COMMON ( root name, rtemp ) \ 
addis rtemp, 0, CONCAT ( root name, jump@ha ); \ 
addi rtemp, rtemp, CONCAT ( root_name, _jump@l ); \ 
lwzx rtemp, rtemp, r0; \ 
mtctr rtemp; \ 
bctr,- 



#define BR VMX VI ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 29, 29; \ ' 
BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 

#define BR VMX V2 ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 28, 29; \ 
BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 

#def ine BR VMX V3 ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 27, 29; \ 
BR__ESAL_JUMP_TABLE_COMMON ( root__name, rtemp ) 

#define BR_VMX__V4 ( root_name, eflag, rtemp ) \ 
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rlwinm rO, eflag, 2, 26, 29; \ 
BR_ESAL_JUMP_TABLE_COMMON { root_name, rtemp ) 

tfdefine BR VMX V5 ( root_name, eflag/ rtemp ) \ 
rlwinm rO, eflag, 2, 25, 29; \ 



BR_ESAL_JUMP_TABLE_COMMON ( 


root_name , rtemp 


#def ine 


BR VMX Zl ( 


root 


name, 


ef lag, 


rtemp 


) 


\ 




BR_VMX_V1 ( 


root 


name , 


eflag, 


rtemp 


) 




#def ine 


BR VMX Z2 ( 


root 


name, 


ef lag, 


rtemp 


) 


\ 




BR_VMX_V2 ( 


root_ 


name r 


eflag, 


rtemp 


) 




#define 


BR VMX Z3< 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V3 ( 


root 


name , 


eflag, 


rtemp 


) 




#define 


BR VMX Z4 ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V4 ( 


root_ 


name , 


eflag, 


rtemp 


) 




#define 


BR VMX Z5 ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V5 ( 


root_ 


name , 


eflag, 


rtemp 


) 





#else /* no ESAL jump table */ 

/* 

* G4 macros to create a dummy jump table. 

* (not supported in C) 
*/ 

#define DECLARE VMX VI ( root name 
#define DECLARE VMX V2 ( root name 
#define DECLARE VMX V3 ( root name 
#def ine DECLARE VMX V4 ( root name 
#define DECLARE_VMX_V5 ( root_name 

#define DECLARE VMX Zl ( root name 
#define DECLARE VMX Z2 ( root name 
#define DECLARE VMX Z3 ( root name 
#define DECLARE VMX Z4 ( root name 
#define DECLARE_VMX_Z5 ( root_name 

/* 

* G4 macros to simply branch to root_name {no jump table) 

* (not supported in C) 
*/ ' 

#define BR VMX VI ( root_name, eflag, rtemp ) \ 
b root_name; 

#define BR VMX V2 ( root_name, eflag , rtemp ) \ 
b root_name; 

#define BR VMX V3 { root_name, eflag , rtemp ) \ 
b root_name; 

ttdefine BR VMX V4 ( root_name, eflag , rtemp ) \ 
b root_name; 

tfdefine BR VMX V5 ( rootjiame, eflag , rtemp ) \ 
b root_name; 



#define 


BR VMX Zl ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V1 ( 


root_ 


name , 


eflag, 


rtemp 


) 




#define 


BR VMX Z2 ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V2 ( 


root_ 


name , 


eflag, 


rtemp 


) 




#def ine BR VMX Z3 ( 


root 


name, 


eflag, 


rtemp 


) 


\ 




BR_VMX_V3 ( 


root^ 


name , 


eflag, 


rtemp 


) 
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#define BR VMX Z4 ( root name, eflag, rterap ) \ 

BR_VMX_V4 { root__name, eflag, rtemp ) 

#define BR VMX Z5 ( root name, eflag, rtemp ) \ 

BR_VMX_V5( rootjiame, eflag, rtemp ) 



#endif 



/* end COMPILE_ESAL_JUMP_TABLE */ 



/* 
* 
* 



G4 macros to decide whether to enter a VMX loop 
VMX loop is entered if at least minimum count, 
all vectors have the same relative alignment* 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddx() can be implemented with a VMX loop. 

* Only one. macro should be invoked per source f ile . 

* (uses rO as scratch) 

* (not supported in C) 
*/ 

#define BR IF VMX VI ( rootjiame, min_n_imm, unit_s__imm, pi, si, n, eflag ) \ 
cmplwi n, min n_imm; \ ~ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v skip_vmx; \ 

BR VMX VI ( rootjiame, eflag, si ) \ 
v_skip_vmx : 

#define BR_IF_VMX_V1_ALIGNED ( root name, min n_imm, imitjs_jLmm, \ 

pi, si, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
andi. rO, pi, Oxf; \ 
bne v skip_vmx; \ 

BR VMX VI ( rootjiame, eflag, si ) \ 
v_skip_ymx : 

#define BR_IF_VMX_V2 ( root name, min n imm, unit_s_imm, \ 

pi, si, p2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v__skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skipjvmx,- \ 

BR VMX V2( rootjiame, eflag, si ) \ 
v_skip_vmx: 

#define BR_IF_VMX_V2_LS ( root name, min n imm, unit_s_imm, \ 

pi, si, ps, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, rO, ps; \ 
bne v_skip vmx; \ 
andi. rO, rO, 0x6; \ 
bne v skip_vmx; \ 

BR_VMX_V2( root_name, eflag, si ) \ 
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v_skip_vmx: 

ftdefine BR_IF_VMX_V2_LC ( root name, min_n imm, unit_s__imm, \ 

pi, sl f pc, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
andi. rO, pc, 1; \ 
bne v_jskip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 2; \ 
bne v skip vmx; \ 
xor rO, rO, pc; \ 
andi. rO, rO, 0x3; \ 
bne v skip_vmx; \ 

BR VMX V2 ( root_name, eflag, si ) \ 
v_skip_vmx: 

#define BR_IF__VMX_V2_ALIGNED{ root name, min n imm, unit_s_imm, \ 

pl, si, p2, s2, n, eflag ) \ 

cmplwi n, min n_ji_mm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
or rO, pl, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip__vmx; \ 

BR VMX V2{ root_name, eflag, si ) \ 
v_skip_vmx: 

ftdefine BR_IF_VMX_V3 ( root name, min n imm, unites imm, \ 

pl, si, p2, s2, p3, s3, n, eflag ) \ 
cmplwi n, min n imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
xor rO, pl, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pl, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; V 

BR VMX V3 ( root_name, eflag, si ) \ 
v_skip_ymx: 

#define BR_IF_VMX__V3_ALIGNED ( root name, min n imm, unites imm, \ 

pl, si, p2, s2, p3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit_s_imm; \ 
or rO, pl, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pl, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_ vmx; \ 

BR_VMX_V3( root_name, eflag, si ) \ 
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v_skip_vmx: 

#define BR_IF_VMX_V4 ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s irnrn; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
bne v_skip vnix; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V4 ( root_name, eflag, si ) \ 
v_skip_vmx: 

#define BR_IF__VMX_V4_ALIGNED ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) \ 

cmplwi n, min njimm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx;' \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p4; \ 
bne v__skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V4 ( root_name, eflag, si ) \ 
v_skip_vmx ; 

#define BR_IF_VMX_V5 { root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, S3, p4, s4, p5, s5, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v__skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne v__skip vmx; \ 
cmpwi s4, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s5, unit s imm; \ 
xor rO, pi, p2;* \ 
bne v__skip vmx,- \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
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bne v_skip vmx; \ 
andi. rO, rO f Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p5; \ 
bne v__skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V5( root_name, eflag, si ) \ 
v_skip_vmx : 

#define B R_I F__VMX_V5 _AL I GNED ( root_name, min n_imm, unit s_imm, \ 

pi, si, p2, s2, p3, S3, p4, s4, p5, s5, n, 'eflag ) 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3 # unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s4, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s5, unit_s_imm; \ 
or rO, pi", p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne vjskip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p5; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V5{ root_jiame, eflag, si ) \ 
v_skip__vmx: 

#define BR_IF_VMX_Z1 { root_name, min n_imm, unit_s_imm, \ 

prl, pil, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s imm; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skipjvmx; \ 

BR VMX Zl( root_name, eflag, si ) \ 
z_skip_vmx : 

#define BR_IF_VMX_Z2 ( root__name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z__skip vmx; \ 
cmpwi si, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi.- rO, rO, Oxf; \ 
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xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx,- \ 

BR VMX Z2( root_name, eflag, si ) \ 
z_skip_vmx : 

tfdefine BR_IF_VMX_Z3 ( rootjiame, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s3, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor r.O, prl, pi 3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z3 ( root_name, eflag, si ) \ 
z_skip_vmx; 

#define BR_IF_VMX_Z4 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
pr4, pi4, s4, n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s2, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s3, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s4, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor.rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr4; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi4; \ 

bne z_skip__vmx; \ 
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andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z4( rootjname, eflag, si ) \ 
z_skip_vmx : 

#define BR_JEF__VMX__Z5 ( root_name, min n imm, unit s imm, \ 

prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
pr4, pi4, s4, pr5, pi5, s5, n, eflag ) \ 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vrnx; \ 

cmpwi s2, unit s imm; \ 

bne z_skip vmx;" \ 

cmpwi S3, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s4, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s5, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr3; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi3; \ 

bne z__skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr4; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi4; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr5; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pi5; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne z skip_vmx; \ 

BR VMX Z5( root_name, eflag, si ) \ 
z_skip_vmx : 

^define BR_IF_VMX_CONV( root name, min n imm, \ 

pi, si, s2, p3, s3, n, eflag ) \ 
cmplwi n, min n_imrn,- \ 
bit v_skip vmx; \ 
cmpwi si, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
.cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V3( root_name, eflag, si ) \ 
v_skip_ymx: 
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#define BRjIF_JVMX_ZCONV( rootjiame, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit z_skip vrax; \ 
cmpwi si, 1; \ 
bne z_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1? \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3; \ 
bne z_skip vmx; \ 
andi . rO , rO , Oxf ; \ 
bne z skip__vmx; \ 

BR VMX Z3 ( root_name, eflag, si ) \ 
z_skip_vmx: 

/* 

* G4 macro to get VMX unaligned word (FP) count 

* assumes that the last 2 bits of ptr are 0 

* sets condition code CRO 

*/ 

#define GET VMX UNALIGNED_COUNT ( count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 30, 30, 31; 

/* 

* G4 macro to get VMX unaligned short count 

* assumes that the last bit of ptr is 0 

* sets condition code CRO 
*/ 

#define GET VMX UNAL I GNED__COUNT_S ( count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 31, 29, 31; 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

#define GET VMX UNALI GNED_COUNT_C ( count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 0, 28, 31; 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

#if defined ( LITTLE END IAN ) 

#define SCALAR_SPLAT ( vt, vtmp, scalarp ) \ 

Ivxl vt, 0, scalarp; \ 

lvsr vtmp, 0, scalarp; \ 

vperm vt, vt, vt, vtmp; \ 

vspltw vt, vt, 3; 
#else 

ftdefine SCALARJSPLAT ( vt, vtmp, scalarp ) \ 
lvxl vt, 0, scalarp; \ 
lvsl vtmp, 0, scalarp; \ 
vperm vt, vt, vt, vtmp; \ 
vspltw vt, vt, 0; 
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#endif 

/* 

* G4 macro to construct an FP absolute value mask that can be used with 

* vand to take the absolute value of 4 FP numbers in a vector register 

* vt « 0x7ffffff f7fffffff7fffffff7fffffff 
*/ . 

#define MAKE VABS MASK( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; \ 
vnor vt, vt, vt; 

/* 

* G4 macro to construct an FP sign mask that can be used with: 

* vandc to take the absolute value of 

* vor to take the negative absolute value of 

* vxor to negate 

* 4 FP numbers in a vector register 

* vt = 0x80000000800000008000000080000000 
*/ 

#define MAKE VSIGNJ4ASK( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; 

/* 

* G4 macros to construct a coded touch stream control register 

* "I" indicates argument is passed as an immediate value 

* "R° indicates argument is passed in an integer register 
* 

* bytes_per block = # of bytes in each block 

* (0 = 512, 16, 32, . .., 480, 512) 

* • block count = # of blocks (0 = 256, 1, 2, 3, ... 256) 

* byte stride = signed byte stride between start of adjacent blocks 

* (-32768 <= byte_stride < 0; 0 = 32768; 0 < byte_stride < 32768) 
*/ 

#define MAKE__STREAM__CODE_JIII { rB, bytes_per_block, block_count, byte_stride ) 

lis rB, {(((bytes per block) » 4) & 31) « 8) | ( <block_count) & 255); \ 
ori rB, rB, ( (byte_stride) & OxOOOOff ff ) ; 

tfdefine MAKE STREAM CODE ( rB, bytes per block, block count, byte stride )• \ 
MAKE_STREAM_CODE_III ( rB, bytes_per_block, block_count, byte_stride ) 

#define MAKE_STREAM_CODE_IIR{ rB, bytes_per_block, block_count, byte_jstride ) 

lis rB, ((({bytes per block) » 4) & 31) « 8) | ( (block_count) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM__CODE_IRI ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, (({(bytes per_block) » 4) & 31) « 8); \ 
ori rB, rB, ( (byte_stride) & OxOOOOff ff) ; 

#define MAKERS TREAM_CODE_IRR{ rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((((bytes perjblock) » 4) & 31) « 8) ; \ 
rlwimi rB, byte__stride, 0, 16, 31; 

#define MAKE_STREAM_CODE__RII ( rB, bytes_jper_block, block_count / byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ((block count) & 255); \ 
ori rB, rB, ( (byte_stride) & OxOOOOff ff) ; 

#define MAKE_STREAM_CODE_RIR ( rB, bytes_per_block, block_count, byte_stride ) 
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\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ((block count) & 255); \ 
rlwimi rB, byte__stride, 0, 16, 31; 

#define MAKE_STREAM__CODE_RRI ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 

rlwimi rB, block count, 16, 8, 15; \ 

ori rB, rB, ( (byte_stride) & OxOOOOff f f ) ; 

#define MAKE_STREAM_CODE_RRR ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 
rlwimi rB, block count, 16, 8, 15; \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#endif /* end BUILD_MAX */ 

#define CACHE TB THRESHOLD 1 /* 2 TB ticks = 12 CPU 100 MHz elks */ 

#define INSTRUCTION CACHE COUNT 3 /* min. to fully cache instructions */ 

#define P0STING_BUFFER_COUNT 10 /* min. to fill posting buffer */ 

/* 

* macros to set DCBx conditions explicitly 
*/ 

ttdefine DCBT TRUE ( cond_bit, scratch ) \ 
li scratch, 0; \ 
cmplwi (cond_bit) , scratch, 1? 

#define DCBZ TRUE ( condjbit, scratch ) \ 
DCBT_TRUE( cond_bit, scratch ) 

#define DCBT FALSE ( cond_bit, scratch ) \ 
li scratch, 2; \ 
cmplwi (cond_bit) , scratch, 1; 

#define DCBZ FALSE ( condjbit, scratch ) \ 
DCBT_FALSE( cond_bit, scratch ) 

/* 

* This macro will cause a file not to assemble. 
V 

#define DO_NOT_ASSEMBLE add scratchl, scratch2, 256; 
/* 

* Obsolete macro will cause assembler error 

#define TEST IF CACHABLE ( cond_bit, buffer, scratchl, scratch2 ) \ 

DO NOT ASSEMBLE 
/* " " 

* Obsolete macro will cause assembler error 
V 

#define TEST IF CACHABLE_ALIGN ( condjbit, buffer, scratchl, scratch2 ) \ 

DO NOT ASSEMBLE 
/* " ~ 

* macros to test if a DCBT or DCBZ instruction should be performed on 

* a particular buffer based on a bit test (cache bit) on a specified 

* ESAL flag. 
*/ 

#define TEST_IF__DCBT( condjbit, cache_bit, eflag, bufer, scratchl, scratch2 ) 
DO__NOT_ASSEMBLE 

#define SET_DCBT_COND ( cond__bit, cache_bit, eflag, scratchl ) \ 
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andi. scratchl, eflag, (cache bit); \ 
cmplwi (cond_bit) , scratchl, 0; 

/* 

* Set 2 debt conditions and ensure only one is true 
* 

* Ins. 1-3 Set both conditions to "No DCBT" 

* Ins. 4 See if veel has a C 

* Ins. 5 Set DCBT condl 

* Ins. 6 Branch if "DCBT TRUE" (eflag & bitl = 0) 

* Ins. 7-8 Set DCBT cond2 

*/ 

#define SET_2_DCBT_COND ( condl bit, cache_bitl, cond2_bit, cache_bit2, \ 

eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl bit), scratch, 0; \ 

be 12, ( (condl_bit)<<2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2_bit) , scratch, 0; 

/* 

* Set 3 debt conditions and ensure only one is true 
* 

* Logic is the similar to SET_2_DCBT_C0ND ( ) macro 
*/ 

#define SET_3__DCBT_COND ( condl bit, cache bitl, cond2 bit, cache_bit2, \ 

cond3_bit, cache_bit3, eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bit3) ; \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( <cond3_bit)<<2)+2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ,- \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit)<<2)+2, PC OFFSET ( 12 ) ,- \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condljbit) , scratch, 0; 

/* 

* Set 4 debt conditions and ensure only one is true 

* Logic is the similar to SET 2 DCBT_C0ND() macro 
*/ 

tfdefine SET_4_DCBT_C0ND ( condl bit, cache bitl, cond2 bit, cache bit2, \ 

cond3 bit, cache_bit3, cond4_bit, cachejDit4, \ 
eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit) , scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit) , scratch, 1; \ 

cmplwi (cond4 bit), scratch, 1/ \ 

andi. scratch, eflag, (cache_bit4) ; \ 

cmplwi (cond4 bit), scratch, ~0; \ 

be 12, ( (cond4_bit)«2) +2, PC OFFSET ( 36 ) ; \ 

andi. scratch, eflag, (cache__bit3) \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3jDit)«2)+2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit)«2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl_bit), scratch, 0; 
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#if i defined COMPILE_NO_DCBZ 

#define SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, ttnp3) \ 
andi. tmp3, eflag, (cache bit) ; \ 
cmplwi (cond bit) , tmp3, 0; \ 
bne PC_OFFSET( 104 ); \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_OFFSET( 92 ); \ 

cmplwi 1, count, (CACHE_LINE_LSIZE«unit_stride) ; \ 

bit 1, PC OFFSET ( 84 ) ; \ 

addi tmp2, buffer, CACHE LINE SIZE; \ 

li tmp3, CACHE LINE ADDRJ4ASK; \ 

and tmp2, tmp2,' tmp3; \ 

mfcr tmp3; \ 

Stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

Stw tmp3, LR SAVE OFF(sp) ; \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl, r3; \ 

mr r3, tmp2; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

Iwz tmp3, LR_SAVE_OFF(sp) ; \ 

mtlr tmp3 ; \ 

lwz tmp3, CR_SAVE_OFF(sp) ; \ 

mtcr tmp3; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3; \ 

mr r3, tmpl; \ 

bne 1, PC OFFSET ( 8 ) ; \ 

cmpwi (cond_bit) , count, -1; 

#define SET_DCBZ__ALIGN_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit stride, count, tmpl, tmp2, tmp3) \ 
andi. tmp3, eflag, (cache bit); \ 
cmplwi (cond bit) , tmp3, 0; \ 
bne PC_OFFSET( 100 ); \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_OFFSET( 88 ) ; \ 

cmplwi 1, count, (CACHE__LINE_LSIZE«unit_stride) ; \ 

bit 1, PC OFFSET ( 80 ) ; \ 

andi. tmp3, buffer, CACHE_LINE_MASK; \ 

bne PC OFFSET ( 72 ) ; \ 

mfcr tmp3; \ 

Stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

stw tmp3, LR SAVE OFF(sp) ; \ 

CREATE STACKJFRAME ( 0 ) \ 

mr tmpl, r3; \ 

mr r3, buffer; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

lwz tmp3, LR_J3AVE_OFF(sp) ; \ 

mtlr tmp3; \ 

lwz tmp3, CR_SAVE_OFF(sp) ; \ 

mtcr tmp3; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3; \ 

mr r3, tmpl; \ 

bne 1 , PC OFFSET { 8 ) ; \ 

cmpwi (cond__bit) , count, -1; 

#else /* COMPILEJtfOJDCBZ is defined */ 

#define SET_DCBZ_COND ( cond_bit, cache_bit, eflag, buffer, stride, \ 
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unit stride, count, tmpl, tmp2, tmp3) \ 
DCBZ_FALSE( cond_bit, tmpl ) 

#define SET_DCBZ_ALIGN_COND ( cond bit, cache bit, eflag, buffer, stride, \ 

unit_stride, count, tmpl, tmp2, tmp3) \ 
DCBZ_FALSE{ cond_bit, tmpl ) 

#endif /* COMPILE_NO_DCBZ */ 
/* 

* macro to perform [or skip] a debt instruction based on the result 

* of a prior call to TEST IF DCBT (specifying the same condition bit) . 

* debt is performed if the cond '*<=" is true; otherwise debt is skipped. 
V 

ttdefine DCBT IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PCJDFFSET( 8 ); \ 
debt rA, rB; 

/* 

* macro to perform [or skip] a debz instruction based on the result 

* of a prior call to TEST IF DCBZ (specifying the same condition bit) . 

* debz is performed if the cond "<=" is true; otherwise debz is skipped. 
*/ 

#if I defined COMPILE_NO_DCBZ 

#define DCBZ IF{ cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PC_0FFSET( 8 ); \ 
debz rA, rB; 

#else 

#define DCBZ IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PC_0FFSET( 8 ); \ 
nop; 

#endif 

/* 

* macro to branch to a label if the buffer specified in a prior 

* call to TEST_IF CACHABLE (also specifying the same condition bit) 

* was cachable (i.e. TB read time was <= CACHE TB THRESHOLD) . 
*/ ~ 

#define BR IF COND TRUE ( cond bit, label ) \ 

be 4, ( (cond__bit)«2)+l, label; /* <= */ 

/* 

* macro to branch to a label if the buffer specified in a prior 

* call to TEST IF CACHABLE (also specifying the same condition bit) 

* was NOT cachable (i.e. TB read time was > CACHE TB THRESHOLD) . 
*/ ~ " 

#define BR IF COND FALSE ( cond bit, label ) \ 

be 12, ( (cond_bit)«2)+l, label; /* > */ 

/* 

* ASIC macros 
*/ 

#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH_CONTROL ( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 

addis scratch2, 0, PREFETCH CONTROL H; \ 

Stw scratchl, PREFETCH__CONTROL_L ( scratch2 ); 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 
addis scratch2, 0, MISCON_B H; \ 
stw scratchl, MISCON_B_L( scratch2 ); 
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#define RESET PREFETCH CONTROL { scratchl, scratch2 ) \ 
addis scratch2, 0, ASIC H; \ 
lwz scratchl, MISCON B L( scratch2 ); \ 
andi. scratchl, scratchl, PREFETCH MASK; \ 
ori scratchl, scratchl, USE PREFETCH CONTROL; \ 
stw scratchl, PREFETCH_CONTROL__L ( scratch2 ); 

#else 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
#define LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 

#endif 



* instruction macros 
*/ 

#define ADD { rD, rA, rB ) 
#define ADD C( rD f rA, rB ) 
#define ADDI ( rD, rA, SIMM ) 
#define ADDIC C( rD, rA, SIMM ) 
#define ADDIS { rD, rA, SIMM ) 
#define AND ( rA, rS, rB ) 
ttdefine AND C{ rA, rS, rB ) 
#define ANDC ( rA, rS, rB ) 
#define ANDC C( rA, rS, rB ) 
#define ANDI C( rA, rS, UIMM ) 
#define ANDIS C( rA, rS, UIMM ) 
tfdefine BA{ label ) 
#define BCTR 
ftdefine BCTRL 
ttdefine BEQ { label ) 
ttdefine BEQ PLUS ( label } 
#define BEQ MINUS ( label ) 
#define BEQ CR( bit, label ) 
tfdefine BEQ CR PLUS ( bit, label ) 
#define BEQ CR_MINUS ( bit, label ) 
#define BEQLR 
#define BEQLR PLUS 
#define BEQLR MINUS 
ttdefine BEQLR CR( bit ) 
tfdefine BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS { bit ) 
ttdefine BGE ( label ) 
#define BGE PLUS{ label ) 
#define BGE MINUS ( label ) 
ttdefine BGE CR( bit, label ) 
#define BGE CR PLUS( bit, label ) 
#define BGE CRJ4INUS ( bit, label ) 
#define BGELR 
#define BGELR PLUS 
#define BGELR MINUS 
#define BGELR CR( bit ) 
#define BGELR CR PLUS { bit ) 
#define BGELR CR MINUS ( bit ) 
#define BGT ( label ) 
#define BGT PLUS ( label ) 
#define BGT MINUS ( label ) 
#define BGT CR{ bit, label ) 
#define BGT CR PLUS( bit, label ) 
#define BGT CR_MINUS( bit, label ) 
#define BGTLR 
#define BGTLR PLUS 
#define BGTLR MINUS 
#define BGTLR__CR ( bit ) 



add rD, rA, rB; 

add. rD, rA, rB; 

addi rD, rA, (SIMM) ; 

addic. rD, rA, (SIMM); 

addis rD, rA, (SIMM) ; 

and rA, rS, rB; 

and. rA, rS, rB; 

andc rA, rS, rB; 

andc. rA, rS, rB; 

andi. rA, rS, (UIMM) ; 

andis. rA, rS, (UIMM) ; 

ba label; 

bctr; 

bctrl ; 

beq label; 

beq+ label; 

beq- label; 

beq (bit), label; 

beq+ (bit), label; 

beq- (bit) , label; 

beqlr; 

beqlr + ; 

beqlr- ; 

beqlr (bit) ; 

beqlr+ (bit) ; 

beqlr- (bit) ,- 

bge label; 

bge+ label; 

bge- label; 

bge (bit), label; 

bge+ (bit), label; 

bge- (bit), label; 

bgelr; 

bgelr+; 

bgelr- ; 

bgelr (bit) ,- 

bgelr+ (bit) ; 

bgelr- (bit) ; 

bgt label; 

bgt+ label; 

bgt- label; 

bgt (bit), label; 

bgt+ (bit), label; 

bgt- (bit), label; 

bgtlr; 

bgtlr+; 

bgtlr-; 

bgtlr (bit); 
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#define BGTLR CR PLUS( bit ) 
#define BGTLR CR MINUS { bit ) 
#define BL( label ) 
ttdefine BLE ( label ) 
#def ine BLE PLUS ( label ) 
#define BLE MINUS ( label ) 
#define BLE CR( bit, label ) 
#define BLE CR PLUS { bit, label ) 
#define BLE CRJVIINUS ( bit, label ) 
#define BLELR 
#define BLELR PLUS 
#def ine BLELR MINUS 
#define BLELR CR ( bit ) 
#define BLELR CR PLUS( bit ) 
#define BLELR_CR_MINUS ( bit ) 
ttdefine BLR 
#define BLRL 
#define BLT ( label ) 
#define BLT PLUS ( label ) 
#define BLT MINUS ( label ) 
#define BLT CR{ bit, label ) 
#define BLT CR PLUS ( bit, label ) 
^define BLT CR_MINUS ( bit, label ) 
#define BLTLR 
#define BLTLR PLUS 
#define BLTLR MINUS 
#define BLTLR CR ( bit ) 
#define BLTLR CR PLUS ( bit ) 
ttdefine BLTLR CR MINUS ( bit ) 
^define BNE ( label ) 
ttdefine BNE PLUS( label ) 
#define BNE MINUS ( label ) 
tfdefine BNE CR( bit, label ) 
#define BNE CR PLUS( bit, label ) 
#define BNE CR_MINUS { bit, label ) 
#define BNELR 
#define BNELR PLUS 
#def ine BNELR MINUS 
tfdefine BNELR CR ( bit ) 
#define. BNELR CR PLUS( bit ) 
#define BNELR CR MINUS ( bit ) 
#define BR ( label ) 
#define CLRLWI ( rA, rS, nbits ) 
#define CLRLWI C{ rA, rS, nbits ) 
#define CLRRWI < rA, rS, nbits ) 
ttdefine CLRRWI_C( rA, rS, nbits ) 
#define CMPLW ( rA, rB ) 
#define CMPLW CR( bit, rA, rB ) 
#define CMPLWI ( rA, UIMM ) 
#define CMPLWI CR( bit, rA, UIMM ) 
#define CMPW ( rA, rB ) 
#define CMPW CR{ bit, rA, rB ) 
#define CMPWI ( rA, SIMM ) 
#define CMPWI_CR( bit, rA, SIMM ) 
#define DCBF ( rA, rB ) 
#def ine DCBI ( rA, rB ) 
#define DCBST( rA, rB ) 
#define DCBT ( rA, rB ) 
#define DCBTST ( rA, rB ) 
#if 'defined COMPILE_NO_DCBZ 
#def ine DCBZ ( rA, rB ) 
$else 

ttdefine DCBZ ( rA, rB ) 
#endif 

#define DECR ( rD ) 
#define DECR C{ rD ) 
#define DIVW ( rD, rA, rB ) 
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bgtlr+ (bit) ; 

bgtlr- (bit) ; 

bl label; 

ble label; 

ble+ label; 

ble- label; 

ble (bit), label; 

ble+ (bit), label; 

ble- (bit), label; 

blelr; 

blelr+; I 

blelr-; 

blelr (bit) ; 

blelr+ (bit) ; 

blelr- (bit) ; 

blr; 

blrl ; 

bit label ; 

blt+ label ; 

bit- label ; 

bit (bit), label; 

blt+ (bit), label ; 

bit- (bit), label; 

bltlr; 

bltlr+; 

bltlr- ; 

bltlr (bit) ; 

bltlr+ (bit) ; 

bltlr- (bit) ; 

bne label; 

bne+ label; 

bne- label; 

bne (bit), label; 

bne+ (bit), label; 

bne- (bit), label; 

bnelr,- 

bnelr+; 

bnelr- ; 

bnelr (bit) ; 

bnelr+ (bit) ; 

bnelr- (bit) ; 

b label; 

clrlwi rA, rS, (nbits) ; 
clrlwi. rA, rS, (nbits) ; 
clrrwi rA, rS, (nbits) ; 
clrrwi. rA, rS, (nbits); 
cmplw rA, rB,- 
cmplw bit, rA, rB; 
cmplwi rA, (UIMM) ; 
cmplwi bit, rA, (UIMM) ; 
cmpw rA, rB; 
cmpw bit, rA, rB; 
cmpwi rA, (SIMM) ; 
cmpwi bit, rA, (SIMM) ; 
dcbf rA, rB; 
dcbi rA, rB; 
dcbst rA, rB; 
debt rA, rB; 
debtst rA, rB; 

debz rA, rB; 

nop; 

addi rD, rD, -1; 
addic. rD, rD, -1; 
divw rD, rA, rB; 
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#define DIVW C( rD, rA, rB ) 
tfdefine DIVWU ( rD, rA, rB ) 
tfdefine DIVWU C( rD, rA, rB ) 
ttdefine EQV( rA, rS, rB ) 
#define EQV C( rA, rS, rB ) 
#define EXTLWI ( rA, rS, n, b } 
#define EXTLWI C( rA, rS, n, b ) 
^define EXTRWI { rA, rS, n, b ) 
#define EXTRWI C( rA, rS, n, b ) 
#define FABS ( frD, frB ) 
tfdefine FADD { frD, frA, frB ) 
#define FADDS ( frD, frA, frB ) 
#define FCMPO( bit, frA, frB ) 
#define FCMPU( bit, frA, frB ) 
^define FCTIW( frD, frB ) 
#define FCTIWZ ( frD, frB ) 
#define FDIV( frD, frA, frB ) 
#define FDIVS( frD, frA, frB ) 
#define FMADD ( frD, frA, frC, frB ) 
#define FMADDS ( frD, frA, frC, frB ) 
#define FMOV( frD, frB ) 
#define FMR ( frD, frB ) 
ftdefine FMUL ( frD, frA, frB ) 
^define FMULS{ frD, frA, frB ) 
^define FMSDB ( frD, frA, frC, frB ) 
#define FMSUBS ( frD, frA, frC, frB ) 
#define FNABS ( frD, frB ) 
#define FNEG ( frD, frB ) 
#define FNMADD { frD, frA, frC, frB ) 
#define FNMADDS ( frD, frA, frC, frB ) 
#define FNMSUB ( frD, frA, frC, frB ) 
#define FNMSUBS ( frD, frA, frC, frB ) 
#define FRES ( frD, frB ) 
#define FRSP( frD, frB ) 
#define FRSQRTE ( frD, frB ) 
#define FSEL ( frD, frA, frC, frB ) 
#define FSUB ( frD, frA, frB ) 
#define FSUBS( frD, frA, frB ) 
#define GOTO( label ) 
#define INCR( rD ) 
tfdefine INCR C( rD ) 
#define INSLWI { rA, rS, n, b ) 
#define INSLWI_C( rA, rS, n, b ) 
+(n)-l; 

#define INSRWI ( rA, rS, n, b ) 
+(n)-l; 

#define INSRWI__C( rA, rS, n, b ) 
+ (n)-l; 

#define LA ( rD, symbol, SIMM ) 

#define LABEL ( label ) 
#define LBZ ( rD, rA, d ) 
#define LBZA { rD, symbol ) 

#define LBZU( rD, rA, d ) 
#define LBZUX( rD, rA, rB ) 
tfdefine LBZX{ rD, rA, rB ) 
#define LFD ( frD, rA, d ) 
ttdefine LFDU ( frD, rA, d ) 
#define LFDUX{ frD, rA, rB ) 
#define LFDX { frD, rA, rB ) 
#define LFS ( frD, rA, d ) 
#define LFSA { frD, symbol, rT ) 

#define LFSU( frD, rA, d ) 
tfdefine LFSUX( frD, rA, rB ) 
#define LFSX ( frD, rA, rB ) 
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divw. rD, rA, rB; 

divwu rD, rA, rB; 

divwu. rD, rA, rB; 

eqv rA, rS, rB; 

eqv. rA, rS, rB; 

rlwinm rA, rS, (b) , 0, (n)-l; 

rlwinm. rA, rS, (b) , 0, (n)-l; 

rlwinm rA, rS, (b) + (n), 32-(n), 31; 

rlwinm. rA, rS, (b) + (n) , 32- (n) , 31; 

fabs frD, frB; 

fadd frD, frA, frB; 

fadds frD, frA, frB; 

fcmpo bit, frA, frB; 

fcmpu bit, frA, frB; 

fctiw frD, frB; 

fctiwz frD, frB; 

fdiv frD, frA, frB; 

fdivs frD, frA, frB; 

fmadd frD, frA, frC, frB; 

fmadds frD, frA, frC, frB; 

FMR ( frD, frB ) 

fmr frD, frB; 

fmul frD, frA, frB; 

fmuls frD, frA, frB; 

fmsub frD, frA, frC, frB; 

fmsubs frD, frA, frC, frB; 

fnabs frD, frB; 

fneg frD, frB; 

fnmadd frD, frA, frC, frB; 

fnmadds frD, frA, frC, frB; 

fnmsub frD, frA, frC, frB; 

fnmsubs frD, frA, frC, frB; 

fres frD, frB; 

frsp frD, frB; 

frsqrte frD, frB; 

fsel frD, frA, frC, frB; 

fsub frD, frA, frB; 

fsubs frD, frA, frB; 

BR ( label ) 

addi rD, rD, 1; 

addic. rD, rD, 1; 

rlwimi rA, rS, 32- (b) , (b) , <b)+(n)-l; 
rlwimi. rA, rS, 32- (b) , (b) , (b) 

rlwimi rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

rlwimi. rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

addis rD, 0, (symbol+ (SIMM) ) @ha; \ 
addi rD, rD, (symbol+ (SIMM) )@1; 
label : 

lbz rD, (d) (rA) ; 

addis rD, 0, (symbol) @ha,- \ 

lbz rD, (symbol ) @1 (rD) ; 

lbzu rD, (d) (rA) ; 

lbzux rD, rA, rB; 

lbzx rD, rA, rB; 

lfd frD, (d) (rA) ; 

lfdu frD, (d) (rA) ; 

If dux frD, rA, rB; 

Ifdx frD, rA, rB; 

If s frD, (d) (rA) ; 

addis rT, 0, (symbol) @ha; \ 

lfs frD, (symbol) ©1 (rT) ; 

lfsu frD, (d) (rA) ; 

lfsux frD, rA, rB; 

lfsx frD, rA, rB; 
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#define LHA{ rD, rA, d ) 
#define LHAA( rD, symbol ) 

#define LHAU{ rD, rA, d ) 
#define LHAUX( rD, rA, rB ) 
#define LHAX ( rD, rA, rB ) 
#define LHZ( rD, rA, d ) 
• #define LHZA{ rD, symbol ) 

^define LHZU( rD, rA, d ) 
#define LHZUX( rD, rA, rB ) 
#define LHZX( rD, rA, rB ) 
#define LI ( rD, SIMM ) 
#define LIS { rD, SIMM ) 
#define LOAD_COUNT( rD ) 
#define LWZ( rD, rA, d ) 
tfdefine LWZA( rD, symbol ) 

ttdefine LWZU( rD, rA, d ) 
ttdefine LWZUX( rD, rA, rB ) 
#define LWZX( rD, rA, rB ) 
ttdefine MCRF ( crfD, crfS ) 
^define MCRFS( crfD, crfS ) 
#define MFCR{ rD ) 
#define MFCTR( rD ) 
#define MFLR( rD ) 
#define MFSPR( rD, SPR ) 
#define MR ( rA, rS ) 
#define MR C( rA, rS ) 
tfdefine MOV( rA, rS ) 
#define MOV C( rA, rS ) 
#define MTCR ( rD ) 
#define MTCTR( rD ) 
#define MTFSFI { crfD, IMM ) 
#define MTLR ( rD ) 
#define MTSPR( SPR, rS ) 
#define MULLI ( rD, rA, SIMM ) 
#define MULLW( rD, rA, rB ) 
#define MULLW_C( rD, rA, rB ) 
#define NAND ( rA, rS, rB ) 
#define NAND_C( rA, rS, rB } 
#define NEG { rD, rA ) 
#define NEG_C( rD, rA ) 
#define NOP 

#define NOR{ rA, rS, rB ) 
#define NOR_C( rA, rS, rB ) 
#define OR( rA, rS, rB ) 
#define OR C( rA, rS, rB ) 
#define ORC ( rA, rS, rB ) 
#define ORC C{ rA, rS, rB ) 
#define ORI ( rA, rS, UIMM ) 
#define ORIS( rA, rS, UIMM ) 
#define RETURN 

#define RLWIMI ( rA, rS, SH, MB, ME ) 
#define RLWIMI C( rA, rS, SH, MB, ME ) 
ftdefine RLWINM ( rA, rS, SH, MB, ME ) 
#define RLWINM_C( rA, rS, SH, MB, ME ) 
#define RLWNM ( rA, rS, rB, MB, ME ) 
#define RLWNM C( rA, rS, rB, MB, ME ) 
#define ROTLW( rA, rS, rB ) 
#define ROTLW C( rA, rS, rB ) 
#define ROTLWI ( rA, rS, n ) 
#define ROTLWI C( rA, rS, n ) 
#define ROTRWI ( rA, rS, n ) 
#define ROTRWI C( rA, rS, n ) 
#define SLW( rA, rS, rB ) 
ttdefine SLW_C( rA, rS, rB ) 
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lha rD, (d) (rA) ; 

addis rD, 0, (symbol) @ha,- \ 

lha rD, (symbol) @1 (rD) ; 

lhau rD, (d) (rA) ; 

lhaux rD, rA, rB; 

lhax rD, rA, rB; 

lhz rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

lhz rD, (symbol) @1 (rD) ,- 

lhzu rD, (d) (rA) ; 

lhzux rD, rA, rB; 

lhzx rD, rA, rB; 

li rD, (SIMM) ; 

lis rD, (SIMM) ; 

mtctr rD; 

lwz rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

lwz rD, (symbol) @1 (rD) ; 

lwzu rD, (d) (rA) ; 

lwzux rD, rA, rB; 

lwzx rD, rA, rB; 

mcrf crfD, crfS; 

mcrfs crfD, crfS; 

mfcr rD; 

mfctr rD; 

mflr rD; 

mfspr rD, SPR; 

mr rA, rS; 

or. rA, rS, rS; 

MR ( rA, rS ) 

MR C( rA, rS ) 

mtcr rD; 

mtctr rD; 

mtfsfi (crfD), (IMM) ; 

mtlr rD; 

mtspr SPR, rS; 

mulli rD, rA, (SIMM) ; 

mullw rD, rA, rB; 

mullw. rD, rA, rB; 

nand rA, rS, rB; 

nand. rA, rS, rB; 

neg rD, rA; 

neg. rD, rA; 

nop; 

nor rA, rS, rB; 
nor. rA, rS, rB; 
or rA, rS, rB; 
or. rA, rS, rB; 
ore rA, rS, rB; 
ore. rA, rS, rB; 
ori rA, rS, (UIMM) ; 
oris rA, rS, (UIMM) ; 
BLR 

rlwimi rA, rS, SH, MB, ME; 
rlwimi. rA, rS, SH, MB, ME; 
rlwinm rA, rS, SH, MB, ME; 
rlwinm. rA, rS, SH, MB, ME; 
rlwnm rA, rS, rB, MB, ME; 
rlwnm. rA, rS, rB, MB, ME; 
rlwnm rA, rS, rB, 0, 31; 
rlwnm. rA, rS, rB, 0, 31; 
rlwinm rA, rS, (n) , 0, 31; 
rlwinm. rA, rS, (n) , 0, 31; 
rlwinm rA, rS, 32- (n), 0, 31; 
rlwinm. rA, rS, 32- (n) , 0, 31; 
slw rA, rS, rB; 
slw. rA, rS, rB; 
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^define SLWI ( rA, rS, SH ) 
#define SLWI C( rA, rS, SH ) 
#define SRAW< rA, rS, rB ) 
#define SRAW C( rA, rS, rB ) 
#define SRAWI ( rA, rS, SH ) 
^define SRAWI C( rA, rS, SH ) 
tfdefine SRW( rA, rS, rB ) 
ttdefine SRW C( rA, rS, rB ) 
tfdefine SRWI( rA, rS, SH ) 
#define SRWI_C( rA, rS, SH ) 
ttdefine STB ( rS, rA, d ) 
tfdefine STBU( rS, rA, d ) 
tfdefine STBUX( rS, rA, rB ) 
#define STBX( rS, rA, rB ) 
tfdefine STFD ( frD, rA, d ) 
^define STFDU( frD, rA, d ) 
ttdefine STFDUX{ frD, rA, rB ) 
#define STFDX( frD, rA, rB ) 
#define STFS( frD, rA ; d ) 
^define STFSU( frD, rA, d ) 
#define STFSUX( frD, rA, rB ) 
#define STFSX( frD, rA, rB ) 
#define STH( rS, rA, d ) 
#define STHU( rS, rA, d ) 
#define STHUX< rS, rA, rB ) 
^define STHX( rS, rA, rB ) 
ftdefine STW( rS, rA, d ) 
#define STWU< rS, rA, d ) 
#define STWUX( rS, rA, rB ) 
#define STWXf rS, rA, rB ) 
t tfdefine SUB( rD, rA, rB ) 

^define SUB C{ rD, rA, rB ) 
#define SUBFIC{ rD, rA, SIMM ) 
#define SUBI{ rD, rA, SIMM ) 
#define SUBIC C( rD, rA, SIMM ) 
#define SUBIS ( rD, rA, SIMM ) 
#define TEST_COUNT( label ) 
#define XOR( rA, rS, rB ) 
#define XOR C( rA, rS, rB ) 
#define XORI ( rA, rS, UIMM ) 
#define XORIS ( rA, rS, UIMM ) 

/* 

* VMX instructions 
*/ 

#define BR VMX ALL TRUE ( label ) 
#define BR VMX ALL FALSE ( label ) 
#define BR VMX NONE TRUE { label ) 
#define BR VMX SOME FALSE ( label ) 
#define BR VMX SOMEJTRUEC label ) 
#define DSS ( STRM ) 
#define DSSALL 
#define DST( rA, rB, STRM ) 
#define DSTST( rA, rB, STRM ) 
#define DSTT( rA, rB, STRM ) 
#define DSTSTT ( rA, rB, STRM ) 
#define LVEBX( vT, rA, rB ) 
#define LVEHX( vT, rA, rB ) 
#define LVEWX{ vT, rA, rB ) 

#if defined ( LITTLE END IAN ) 

^define LVSL( vT, rA, rB ) 

#define LVSR( vT, rA, rB ) 
$else 

#define LVSLC vT, rA, rB ) 

#define LVSR( vT, rA, rB ) 
#endif 
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slwi rA, rS, (SH) ; 
slwi. rA, rS, (SH) ; 
sraw rA, rS, rB; 
sraw. rA, rS, rB; 
srawi rA, rS, (SH) ; 
srawi. rA, rS, (SH) ; 
srw rA, rS, rB; 
srw. rA, rS, rB; 
srwi rA, rS, (SH) ; 
srwi. rA, rS, (SH) ,- 
stb rS, (d) (rA) ,- 
stbu rS, (d) (rA) ; 
stbux rS, rA, rB; 
stbx rS, rA, rB; 
stfd frD, (d) (rA) ,- 
stfdu frD, (d) (rA) ; 
stfdux frD, rA, rB? 
stfdx frD, rA, rB; 
stfs frD, (d) (rA) ; 
stfsu frD, (d) (rA) ; 
stfsux frD, rA, rB; 
stfsx frD, rA, rB; 
sth rS, (d) (rA) ; 
sthu rS, (d) (rA) ; 
sthux rS, rA, rB; 
sthx rS, rA, rB; 
stw rS, (d) (rA) ; 
stwu rS, (d) (rA) ; 
stwux rS, rA, rB; 
stwx rS, rA, rB; 
sub rD, rA, rB; 
sub. rD, rA, rB; 
subfic rD, rA, (SIMM) ; 
subi rD, rA, (SIMM) ; 
subic. rD, rA, (SIMM) ; 
subis rD, rA r (SIMM) ; 
bdnz label; 
xor rA, rS, rB; 
xor. rA, rS, rB; 
xori rA, rS, (UIMM) ; 
xoris rA, rS, (UIMM) ; 



bt 24, label; 

bt 26, label ; 

bt 26, label; 

bf 24, label; 

bf 26, label; 

dss STRM, 0; 

dss 0, 1; 

dst rA, rB, STRM; 

dstst rA, rB, STRM; 

dstt rA, rB, STRM; 

dst st t rA, rB, STRM; 

Ivebx vT, rA, rB; 

lvehx vT, rA, rB; 

Ivewx vT, rA, rB; 



lvsr vT, rA, rB; 

lvsl vT, rA, rB; 

lvsl vT, rA, rB; 

lvsr vT, rA, rB; 
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#def ine 
tfdefine 
#def ine 
#define 
#define 
#define 
ttdefine 
#define 
#def ine 
#def ine 
#define 
tfdefine 
#def ine 
#define 
#def ine 
#define 
#def ine 
#define 
ftdefine 
#define 
ftdefine 
#define 
#define 
#define 
#define 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
ttdefine 
^define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#define 
#define 
#define 
ttdefine 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#def ine 
#define 
ftdef ine 
#define 
#def ine 
ttdefine 
ftdefine 
^define 
#def ine 
#define 
ftdefine 
ftdefine 



LVX( vT, rA, rB ) 
LVXL( vT, rA, rB ) 
STVEBX( vS, rA, rB ) 
STVEHX( vS, rA, rB ) 
STVEWX( vS f rA, rB ) 
STVX(. vS, rA, rB ) 
STVXL( vS, rA, rB ) 
VADDFP ( vT, vA, vB ) 
VADDSBS< vT, vA, vB ) 
VADDSHS( vT, vA, vB ) 
VADDSWS( vT, vA, vB ) 
VADDUBM ( vT, vA, vB ) 
VADDUBS ( vT, vA, vB ) 
VADDUHM { vT, vA, vB ) 
VADDUHS ( vT, vA, vB ) 
VADDUWM( vT, vA, vB ) 
VADDUWS ( vT, vA, vB ) 
VAND ( vT, vA, vB ) 
VANDC{ vT, vA, vB ) 
VCMPEQFPC vT, vA, vB ) 
VCMPEQFP C( vT, vA, vB 
VCMPEQUB ( vT, vA, vB ) 
VCMPEQUB C( vT, vA, vB 
VCMPEQUH( vT, vA, vB ) 
VCMPEQUH C( VT, VA, vB 
VCMPEQUW( vT, vA, vB ) 
VCMPEQUW C( vT, vA, vB 
VCMPGEFP ( vT, vA, vB ) 
VCMPGEFP C( vT, vA, vB 
VCMPGTFP{ vT, vA, vB ) 
VCMPGTFP C( vT, vA, vB 
VCMPGTSB( vT, vA, vB ) 
VCMPGTSB C( vT, vA, vB 
VCMPGTSH( vT, vA, vB ) 
VCMPGTSH C( vT, vA, VB 
VCMPGTSW( vT, vA, VB ) 
VCMPGTSW C( vT, vA, vB 
VCMPGTUB( vT, vA, vB } 
VCMPGTUB C{ vT, vA, vB 
VCMPGTUH( vT, vA, vB ) 
VCMPGTUH C( vT, vA, vB 
VCMPGTUW( vT, vA, vB ) 
VCMPGTUW C( vT, vA, vB 
VCFSX( vT, vB, UIMM ) 
VCFUX( vT, vB, UIMM ) 
VCTSXS( vT, vB, UIMM ) 
VCTUXS( vT, vB, UIMM ) 
VEXPTEFP( vT, vB ) 
VLOGEFP( vT, VB ) 
VMADDFP ( vT, vA, vC, vB ) 
VMAXFP( VT, vA, vB ) 
VMAXSB( vT, vA, vB ) 
VMAXSH( vT, vA, vB ) 
VMAXSW( vT, vA, vB ) 
VMAXUB( vT, vA, vB ) 
VMAXUH( vT, vA, vB ) 
VMAXUW( vT, vA, vB ) 
VMHADDSHS ( vD, vA, vB, vC ) 
VMHRADDSHS { vD, vA, vB, vC ) 
VMINFP( vT, vA, vB ) 
VMINSB( vT, vA, vB ) 
VMINSH( vT, vA, vB ) 
VMINSW( vT, vA, vB ) 
VMINUB( vT, vA, vB ) 
VMINUH( vT, vA, vB ) 
VMINUW( vT, vA, vB ) 



Ivx vT, rA, rB; 
lvxl vT, rA, rB,- 
stvebx vS, rA, rB; 
stvehx vS, rA, rB; 
stvewx vS, rA, rB; 
stvx vS, rA, rB; 
stvxl vS, rA, rB; 
vaddfp vT, vA, vB; 
vaddsbs vT, vA, vB; 
vaddshs vT, vA, vB; 
vaddsws vT, vA, vB; 
vaddubm vT, vA, vB; 
vaddubs vT, vA, vB; 
vadduhm vT, vA, vB; 
vadduhs vT, vA, vB; 
vadduwm vT, vA, vB; 
vadduws vT, vA, vB; 
vand vT, vA, vB; 
vandc vT, vA, vB; 
vcmpegfp vT, vA, vB; 
vcmpeqfp. vT, vA, vB; 
vcmpequb vT, vA, vB; 
vcmpequb. vT, vA, vB; 
vcmpequh vT, vA, vB; 
vcmpequh . vT , vA, vB ; 
vcmpequw vT, vA, vB; 
vcmpequw. vT, vA, vB; 
vcmpgefp vT, vA, vB; 
vcmpgefp. vT, vA, vB; 
vcmpgtfp vT, vA, vB; 
vcmpgtfp. vT, vA, vB; 
vcmpgtsb vT, vA, vB; 
vcmpgtsb. vT, vA, vB; 
vcmpgtsh vT, vA, vB; 
vcmpgtsh. vT, vA, vB; 
vcmpgtsw vT, vA, vB; 
vcmpgtsw. vT, vA, vB; 
vcmpgtub vT, vA, vB; 
vcmpgtub. vT, vA, vB; 
.vcmpgtuh vT, vA, vB; 
vcmpgtuh, vT, vA, vB; 
vcrapgtuw vT, vA, vB; 
vcmpgtuw. vT, vA, vB; 
vcf sx vT, vB, (UIMM) ; 
vcfux vT, vB, (UIMM) ; 
vctsxs vT, vB, (UIMM) ; 
vctuxs vT, vB, (UIMM) ; 
vexptefp vT, vB; 
vlogefp vT, vB; 
vmaddfp vT, vA, vC, vB; 
vmaxfp vT, vA, vB; 
vmaxsb vT, vA, vB; 
vrnaxsh vT, vA, vB; 
vmaxsw vT, vA, vB; 
vmaxub vT, vA, vB ; 
vmaxuh vT, vA, vB; 
vmaxuw vT, vA, vB; 
vmhaddshs vD, vA, vB, vC; 
vmhraddshs vD, vA, vB, vC; 
vminfp vT, vA, vB; 
vminsb vT, vA, vB; 
vminsh vT, vA, vB; 
vminsw vT, vA, vB; 
vminub vT, vA # vB; 
vminuh vT, vA, vB; 
vminuw vT, vA, vB; 
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#define VMLADDUHM { vD, vA, vB, vC ) vmladduhm vD ( vA, vB, vC; 
#define VMR{ vD, vS ) vor vD, vS, vS; 



#if defined ( LITTLE_ENDIAN 



#define 


VMRGHB ( 


vT,~ 


vA, 


vB ) 


vmrglb 


vT, 


vB, 


vA; 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB ) 


vmrglh 


vT, 


VB, 


vA; 


#define 


VMRGHW ( 


vT, 


vA, 


vB ) 


vmrglw 


vT, 


vB, 


vA; 


#def ine 


VMRGLB ( 


vT, 


vA, 


vB ) 


vrarghb 


vT, 


VB, 


VA; 


#define 


VMRGLH ( 


VT, 


vA, 


vB ) 


vmrghh 


vT, 


vB, 


vA; 


#define 


VMRGLW ( 


vT, 


vA, 


vB ) 


vmrghw 


vT, 


VB, 


VA; 


#else 


















#def ine 


VMRGHB ( 


vT, 


vA, 


vB ) 


vrarghb 


vT, 


vA, 


VB; 


#def ine 


VMRGHH ( 


vT, 


vA, 


vB ) 


vmrghh 


vT, 


vA, 


VB; 


#define 


VMRGHW ( 


vT, 


vA, 


vB ) 


vmrghw 


vT, 


vA, 


VB; 


#define 


VMRGLB { 


vT, 


vA, 


vB ) 


vmrglb 


vT, 


VA, 


VB; 


#def ine 


VMRGLH ( 


vT, 


vA, 


vB ) 


vmrglh 


vT, 


vA, 


VB; 


#define 


VMRGLW ( 


vT, 


vA, 


vB ) 


vmrglw 


vT, 


vA, 


vB; 



#endif 



#define 
#de£ ine 
#define 
#define 
#define 
#def ine 
#define 
#define 
#define 
#def ine 
#define 
#def ine 
#define 
#define 
#define 
#define 
#define 
#define 



VMSUMMBM 
VMSUMSHM 
VMSUMSHS 
VMSUMUBM 
VMSUMUHM 
VMSUMUHS 
VMULESB( 
VMULESH{ 
VMULEUB( 
VMULEUH ( 
VMULOSB( 
VMULOSH ( 
VMULOUB( 
VMULOUH( 
VNMSUBFP 
VNOR( vT 
VNOT( vT 
V0R( vT 



( VT, vA, 
( vT, vA, 
( vT, vA, 
{ vT, vA, 
( vT, vA, 
( vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
vT, vA, 
( vT, vA, 
, vA, vB 
, vA ) 
vA, vB ) 



vB, 
vB, 
vB, 
vB, 
vB, 
vB, 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vC, 

) 



vC ) 
vC ) 
vC ) 
vC ) 
vC ) 
vC ) 



vB ) 



#if defined { LITTLE ENDIAN ) 
#define VPERM( vT, vA, vB, vC 
#define VPKUHUM( vT, vA, vB 
#define VPKDHUS( vT, 
#define VPKSHUS( vT, 
#define VPKSHSS{ vT, 
#define VPKUWUM( vT, 
#define VPKUWUSf vT, 
#define VPKSWUS( vT, 
#define VPKSWSS( vT, 
#else 

#define VPERM{ vT, vA, vB, vC 
#define VPKUHUM( vT, vA, vB 
.#define VPKUHUS( vT, 
#define VPKSHUS( vT, 
#define VPKSHSS ( vT, 
#define VPKUWUM( vT, 
#define VPKUWUS( vT, 
#define VPKSWUS( vT, 
#define VPKSWSS ( vT, 
#endif 



vA, 
vA, 
vA, 
vA, 
vA, 
vA, 
vA, 



vA, 
vA, 
vA, 
vA, 
vA, 
vA, 
vA, 



vB 
vB 
vB 
vB 
vB 
vB 
vB 



vB 
vB 
vB 
vB 
vB 
vB 
vB 



#define VREFP( vT, vB ) 
#define VRFIM( vT, vB ) 
#define VRFIN( vT, vB ) 
#define VRFIP{ vT, vB ) 
#define VRFIZ( vT, vB ) 
#define VRLB ( vT, vA, vB ) 
#define VRLH ( vT, vA, vB ) 



vmsummbm vT, 
vmsumshm vT, 
vmsumshs vT, 
vmsumubm vT, 
vmsumuhra vT, 
vmsumuhs vT, 
vmulesb vT, 
vmulesh vT, 
vmuleub vT, 
vmuleuh vT, 
vmulosb vT, 
vmulosh vT, 
vmuloub vT, 
vmulouh vT, 
vnmsubfp vT, 
vnor vT, vA, 
vnor vT, vA, 
vor vT, vA, 



vA, vB, vC; 

vA, vB, vC; 

vA, vB, vC; 

VA, vB, VC; 

vA, vB, vC; 

vA, vB, vC; 
vA, vB; 
VA, vB; 
vA, vB? 
vA, vB f - 
vA, vB,- 
vA, vB; 
vA, vB; 
vA, vB; 

vA, vC, VB; 

vB; 

vA; 
vB; 



vperm vT, vB, vA, vC; 

vpkuhum vT, vB, vA; 

vpkuhus vT, vB, vA; 

vpkshus vT, vB, vA; 

vpkshss vT, vB, vA; 

vpkuwum vT, vB, vA; 

vpkuwus vT, vB, vA; 

vpkswus vT, vB, vA; 

vpkswss vT, vB, vA; 

vperm vT, vA, vB, vC; 
vpkuhum vT, vA, vB; 
vpkuhus vT, vA, vB; 
vpkshus vT, vA, vB; 
vpkshss vT, vA, vB; 
vpkuwum vT, vA, vB; 
vpkuwus vT, . vA, vB; 
vpkswus vT, vA, vB; 
vpkswss vT, vA, vB; 



vrefp vT, vB; 
vrfim vT, vB; 
vrfin vT, vB; 
vrfip vT, vB; 
vrfiz vT, vB; 
vrlb vT, vA, vB; 
vrlh vT, vA, vB; 
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#define VRLW( vT, vA, vB ) 
#define VRSQRTEFP ( vT, vB ) 
#define VSEL ( vT, vA, vB, vC ) 
#define VSL( vT, vA, vB ) 

#if defined ( LITTLE__END IAN ) 
#define VSLDOI ( vT, vA, vB, UIMM ) 
#else 

#define VSLDOI ( vT, vA, vB, UIMM ) 
#endif 



3/9/2001 

vrlw vT, vA, vB; 
vrsqrtefp vT, vB; 
vsel vT, vA, vB, vC; 
vsl vT, vA, vB; 



vsldoi vT, vB, vA, (16 - (UIMM) ) ; 
vsldoi VT, vA, VB, (UIMM) ; 



#define VSLB ( vT, vA, vB ) 
#define VSLH( vT, vA, vB ) 
#define VSLO( vT, vA, vB ) 
#define VSLW( vT, vA, vB ) 
#define VSR( vT, vA, vB ) 
#define VSRAB ( vT, vA, vB ) 
#define VSRAH( vT, vA, vB ) 
#define VSRAW( vT, vA, vB ) 
#define VSRB ( vT, vA, vB ) 
^define VSRH( vT, vA, vB ) 
#define VSRO( vT, vA, vB ) 
#define VSRW( vT, vA f vB ) 
#define VSPLTB( vT, vB, UIMM ) 
#define VSPLTH( vT, vB, UIMM ) 
#define VSPLTW( vT, vB r UIMM ) 
#define VSPLTISBf vT, SIMM ) 
#define VSPLTISH( vT, SIMM ) 
#define VSPLTISW( vT, SIMM ) 
#define VSUBFP( vT, vA, vB ) 
#define VSUBSBS( vT, vA, vB 
#define VSUBSHS( vT, vA, vB 
#define VSUBSWS( vT, vA, vB 
#define VSUBUBM( vT, vA, vB 
#define VSUBUBS( vT, vA, vB 
#define VSUBUHM ( vT, vA, vB 
#define VSUBUHS( vT, vA, vB 
#define VSUBUWM( vT, vA, vB 
#define VSUBUWS( vT, vA, vB 
#define VSUMSWS ( vT, vA, vB 
#define VSUM2SWS ( vT, vA, vB ) 
#define VSUM4SBS ( vT, vA, vB ) 
#define VSUM4SHS( vT, vA, vB ) 
#define VSUM4UBS( vT, vA, vB ) 

#if defined ( LITTLE END IAN ) 
#define VUPKHSB ( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
#define VUPKLSB( vT, vB ) 
#define VUPKLSH( vT, vB ) 
#else 

#define VUPKHSB ( vT, vB ) 
tfdefine VUPKHSH ( vT, vB ) 
#define VUPKLSB( vT, vB ) 
ttdefine VUPKLSH( vT, vB ) 
#endif 

#define VX0R( vT, vA, vB ) 



vslb vT, vA, vB; 
vslh vT, vA, vB; 
vslo vT, vA, vB; 
vslw vT, vA, vB; 
vsr vT, vA, vB; 
vsrab vT, vA, vB; 
vsr ah vT, vA, vB; 
vsr aw vT, vA, vB; 
vsrb vT, vA, vB; 
vsrh vT, vA, vB; 
vsro vT, vA, vB; 
vsrw vT, vA, vB; 
vspltb vT, vB, C 
vsplth vT, vB, S 
vspltw vT, vB, L 
vspltisb vT, (SIMM) 
vspltish vT, (SIMM) 
vspltisw vT, (SIMM) 
vsubfp vT, vA, vB; 
vsubsbs vT, vA, vB; 
vsubshs vT, vA, 
vsubsws vT, vA, 
vsububm vT, vA, 
vsububs vT, vA, 
vsubuhm vT, vA, 
vsubuhs vT, vA, 
vsubuwm vT, vA, 
vsubuws vT, vA, 
vsurasws vT, vA, 
vsum2sws vT, vA, 



INDEX MUNGE ( UIMM ) ; 
INDEX MUNGE ( UIMM ) ; 
INDEX_MUNGE ( UIMM ) ; 



vB; 
VB; 
vB; 
VB; 
vB; 
VB; 
vB; 
VB; 
VB; 
vB 



vsum4sbs vT, vA, vB 
vsurn4shs vT, vA, vB 
vsum4ubs vT, vA, vB 



vupklsb vT, vB; 

vupklsh vT, vB; 

vupkhsb vT, vB; 

vupkhsh vT, vB; 

vupkhsb vT, vB; 

vupkhsh vT, vB; 

vupklsb vT, vB; 

vupklsh vT, vB; 



vxor vT, vA, vB; 



* stack and register macros 
*/ 

#define VRSAVE_COND 7 
#undef VOLATILE rl3 



/* reconimended VR condition bit */ 
/* rl3 volatile or non-volatile */ 



#define MIN STACK ALIGN 16 
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#define MIN_STACK_ALIGN_MASK ( MIN_STACK_ALIGN - 1)' 

#define ALIGN STACK( nbytes ) \ 

(((nbytes) + MIN_STACK_ALIGN_MASK) & ~MIN_STACK__ALIGN_MASK) 

#def ine LR SAVE OFF 4 

#define FPR_SAVE_OFF (- (32-14) *8) 

#if defined ( VOLATILE_rl3 ) 

#define GPRjSAVE_OFF ( FPR_S AVEjOFF - (32-14) *4) 
#else 

#define GPR_SAVE_OFF ( FPR_jSAVE_OFF - (32-13) *4) 
#endif 

^define CR_SAVE_OFF (GPR_SAVE_OFF - 4) 
#if defined ( BUILD_MAX ) 

#define VRSAVE_SAVE_OFF (CR_SAVE_OFF - 4) 
#if defined ( VOLATILE rl3 ) 

#define ALIGNMENT_PADDING_OFF ( VRSAVE_SAVE__OFF - 0) 
#else 

#define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE OFF - 12) 
#endif 

ttdefine VR SAVE OFF ( ALIGNMENT_PADDING JDFF - (32-20) *16) 
tfdefine LAST_OFF VR_SAVE_OFF 

#else 

#define LASTJ3FF CR_SAVE_OFF 
#endif 

#define REG SAVE SIZE (-LASTJDFF) 
^define MAX NARGS 18 
#define ARGS SIZE (MAXJNARGS * 4) 
#define LINK SIZE 8 

#define STACK_FRAME__SI ZE (REGJSAVEJSIZE + ARGS_SIZE + LINK_SIZE) 
/* 

* macros to obtain the byte offset into the stack for the last FPR 

* and GPR registers for small temporary storage. 

* FPR_SAVE AREA OFFSET points to an area of 8 * (# of unsaved non-volatile 

* FPR registers) . 

* GPR_SAVE AREA OFFSET points to an area of 4 * (# of unsaved non-volatile 

* GPR registers) . 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET__G PR_S AVE_AREA places the start of the GPR save area into a register 

* For MAX only: 
* 

* VR_SAVE AREA OFFSET points to an area of 16 * (# of unsaved non-volatile 

* VR registers) . 

* GET_VR_SAVE_AREA places the start of the VR save area into a register 
*/ 

#def ine FPR SAVE AREA OFFSET FPR SAVE OFF 
#define GPR_SAVE_AREAjOFFSET GPR_SAVE_OFF 

#define GET FPR SAVE AREA( ptr ) \ 

addi ptr, sp, FPR_£AVE_AREAJDFFSET; 

#define GET GPR SAVE AREA( ptr ) \ 

addi ptr, sp, GPR_SAVE_AREA_OFFSET; 

#if defined ( BUILD_MAX ) 
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#def ine VR_SAVE_AREA__OFFSET VR_SAVE_OFF 

#define GET VR SAVE AREA( ptr ) \ 

addi ptr/ sp, VR_SAVE_AREA_OFFSET ; 
#endif 

/* 

* if the function creates a stack frame with local storage, 

* LOCAL STORAGE OFFSET is the stack offset to the start of this 

* storage and is guaranteed to have the minimum stack alignment. 
*/ 

#define LOCAL_STORAGE_OFFSET (LINK_SIZE + ARGS_SIZE) 
/* 

* macros to create and destroy a stack frame. 
* 

* CREATE_STACK FRAME [ X] creates a stack frame that can handle up to 

* 18 GPR register arguments and a local storage size <= 

* 32768 - 512 = 32,256 bytes. 
* 

* CREATE_STACK_FRAME_X destroys rO . 
* 

* For CREATE_STACK_FRAME_X, local_nbytes_reg must not be rO . 
* 

* Both CREATE STACK FRAME [ X] and DESTROY STACK FRAME should not be 

* called before registers are saved or after they are restored. 
* 

* The stack pointer "output from" CREATE STACK__FRAME [_X] must be 

* the same "input to" DESTROY_STACK FRAME. 
*/ 

#define CREATE STACK FRAME ( local nbytes ) \ 

stwu sp, -ALIGN_STACK ( STACK_FRAME_SIZE + (local_nbytes) ) (sp) ; 

#define CREATE STACK FRAME X( local nbytes reg ) \ 

addi rO, local nytes reg, (STACK FRAME_SIZE + MIN_STACK_ALIGN_SIZE) ; \ 
andi. rO, rO, ~MIN_STACK_ALIGN_MASK; \ 
stwux sp, sp, rO; 

#define DESTROY STACK_FRAME \ 
lwz sp, 0 (sp) ; 

/* 

* macros to allocate and free space on the user stack. 

* with a fixed alignment of MIN STACK ALIGN. 

* nbytes must be <= (32768 - 432 * 32,336). 

* On return, sp points to a buffer of nbytes bytes. 
*/ 

ftdefine PUSH STACK ( nbytes ) \ 

addi sp, sp, -ALIGN_STACK( REG_SAVE_SIZE + (nbytes) ) ; 

#define POP_STACK( nbytes ) \ 

addi sp, sp, ALIGN_STACK( REG_SAVE_SIZE + (nbytes) ) ; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
mr ptr, sp; 

#define FREE__STACK__SPACE ( nbytes ) POP_STACK( nbytes ) 
/* 

* macros to create and destroy a stack buffer with a variable 

* alignment and size. 

* CREATE STACK BUFFER [ X] creates a buffer of size nbytes and alignment 

* byte align on the stack, returning a pointer to the buffer in the 

* GPR bufferp. 
* 
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* bufferp must be a GPR other than rO and rl (sp) . 

* byte align must be a power of 2 such that 2 <= byte_align <= 4096. 

* CREATE_STACK_BUFFER destroys rO. 
* 

* CREATE STACK BUFFER [ X] stores the original value of the stack pointer 

* below the buffer at offset 0 from the new stack pointer. 
* 

* DESTROY STACK BUFFER sets the stack pointer to the value stored 

* at the address pointed to by the input stack pointer. 
* 

* Both CREATE STACK BUFFER [ X] and DESTROY STACK BUFFER should not be 

* called before registers are saved or after they are restored. 
* 

* The stack pointer "output from" CREATE STACK__BUFFER [__X] must be 

* the same "input to" DESTROY STACK BUFFER. 
*/ " 

#define CREATE STACK BUFFER ( bufferp, byte align, nbytes ) \ 

addis bufferp, sp, (-(REG SAVE SIZE + (nbytes)) + 32768)@h; \ 

li rO, (((byte align) - 1) | MIN STACK ALIGN MASK) ; \ 

addi bufferp, bufferp, (- (REG__SAVE_SIZE + (nbytes) )) @1 ; \ 

andc bufferp, bufferp, rO; \ 

sub rO, bufferp, sp; \ 

addic rO, rO, - MIN_STACK_AL IGN ; \ 

stwux sp, sp, rO; 

#define CREATE STACK BUFFER X( bufferp, byte_align, nbytes_reg') \ 
sub bufferp, sp, nbytes_reg; \ 

li rO, (((byte align) - 1) | MIN STACK_ALIGN_MASK) ; \ 

addi bufferp, bufferp, -REG SAVE_SIZE; \ 

andc bufferp, bufferp, rO ; \ 

sub rO, bufferp, sp; \ 

addic rO, rO, -MIN_STACK_ALIGN ; \ 

Stwux sp, sp, rO; 

#define DESTROY S TACK_BUFFER \ 
lwz sp, 0 (sp) ; 

/* 

* macros to create and destroy the salcache buffer on the user stack. 

* CREATE__STACK_SALCACHE destroys rO . 
* 

* Both CREATE STACK SALCACHE and DESTROY STACK SALCACHE should not be 

* called before registers are saved or after they are restored. 
*/ 

#define CREATE STACK SALCACHE { cachep ) \ 

CREATE_STACK_BUFFER( cachep, SALCACHE_ALIGN , SALCACHE_ALLOC_jSIZE ) 

#def ine DESTROY__STACK_jSALCACHE DESTROY_STACK_BUFFER 

/* 

* macros for saving and restoring non -volatile 

* floating point registers (FPRs) 
*/ 
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fl8( 


lfd ) 


#def ine 


REST 


f 14 


fl9 


SR 


fl4 


fl9( 


lfd ] 


#def ine 


REST 


f 14 


f 20 


SR 


fl4 


f20 ( 


lfd 


#def ine 


REST 


f 14 


f21 


SR 


fl4 


f21( 


lfd ) 


#define 


REST 


f 14 


f22 


SR 


fl4 


f22 ( 


lfd ) 


#define 


REST 


f 14 


f23 


SR 


fl4 


f23( 


lfd : 


#define 


REST 


f 14 


£24 


SR 


fl4 


f24{ 


lfd ) 


, #define 


REST 


fl4 


f25 


SR 


fl4 


f25( 


lfd ) 


#def ine 


REST 


fl4 


£26 


SR 


fl4 


f26< 


lfd ) 


#def ine 


REST 


fl4 


f27 


SR 


fl4 


f27 ( 


lfd ) 


#def ine 


REST 


fl4 


f28 


SR 


fl4 


£28 ( 


lfd ) 


#def ine 


REST 


fl4 


f29 


SR 


fl4' 


£29 ( 


lfd ; 


#define 


REST 


f!4 


f30 


SR 


fl4 


£30 ( 


lfd ) 


#def ine 


REST_ 


_fl4_ 


„f31 


SR_ 


fl4 


f31( 


lfd ) 


#define 


REST 


dl4 


SR 


fl4 \ 


lfd ) 




#define 


REST 


dl4 


dl5~ 


SR 


fl4 


fl5{ 


lfd ) 


#define 


REST 


dl4 


dl6 


SR 


fl4 


fl6{ 


lfd ) 


#define 


REST 


dl4 


dl7 


SR 


fl4 


fl7( 


lfd ) 


#def ine 


REST 


dl4 


dl8 


SR 


fl4 


fl8< 


lfd ) 


#def ine 


REST 


dl4 


dl9 


SR 


fl4 


fl9( 


lfd ) 


#define 


REST 


dl4 


d20 


SR 


fl4 


f20 ( 


lfd ) 


#def ine 


REST 


dl4 


d21 


SR 


fl4 


f21( 


lfd ) 


#def ine 


REST 


dl4 


d22 


SR 


fl4 


f22 ( 


lfd ) 


#def ine 


REST 


dl4 


d23 


SR 


fl4 


f23( 


lfd ) 


#define 


REST 


dl4 


d24 


SR 


fl4 


f24 ( 


lfd ) 


#def ine 


REST 


dl4 


d25 


SR 


fl4 


f25( 


lfd ) 


#def ine 


REST 


dl4 


d26 


SR 


fl4 


f26( 


lfd ) 


#def ine 


REST 


dl4 


d27 


SR 


fl4 


£27 ( 


lfd ) 


ttdefine 


REST 


dl4 


d28 


SR 


fl4 


f28( 


lfd ) 


#define 


REST 


dl4 


d29 


SR 


fl4 


f29( 


lfd ) 


#def ine 


REST 


dl4 


d30 


SR 


fl4 


f30( 


lfd ) 


#define 


REST 


dl4 


d31 


SR 


fl4 


f31 { 


lfd ) 



/* 

* macros common to both FPR save and restore 
*/ 

#define SR_f 14 ( opcode ) \ 
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Opcode fl4, (FPR_SAVE OFF + 17*8) (sp) ; 
#define SR fl4_fl5( opcode ) \ 

opcode fl5, (FPR_SAVE_OFF + 16*8) (sp) ; \ 

SR f 14 ( opcode ) 
#define SR f 14_f 16 ( opcode ) \ 

opcode fl6, (FPR SAVE_OFF + 15*8) (sp) ; \ 

SR fl4 fl5( opcode ) 
#define SR f 14_f 17 ( opcode ) \ 

opcode £17, (FPR SAVE_OFF + 14*8) (sp) ; \ 

SR fl4 fl6( opcode ) 
#define SR f 14_f 18 ( opcode ) \ 

opcode fl8, (FPR SAVEJ3FF + 13*8) (sp) ; \ 

SR fl4 fl7( opcode ) 
#define SR fl4_fl9( opcode ) \ 

opcode fl9, (FPR SAVEJDFF + 12*8) (sp) ; \ 

SR fl4 fl8( opcode ) 
#define SR fl4_f20( opcode ) \ 

opcode f20, (FPR SAVEJ3FF + 11*8) (sp) ; \ 

SR fl4 fl9( opcode ) 
ftdefine SR fl4_f21( opcode ) \ 

opcode f21, (FPR SAVEJ3FF + 10*8) (sp) ; \ 

SR fl4 f20( opcode ) 
#define SR f 14_f 22 ( opcode ) \ 

Opcode f22, (FPR SAVE_OFF + 9*8) (sp) ; \ 

SR fl4 f21( opcode ) 
#define SR f 14_f 23 ( opcode ) \ 

opcode f23, (FPR SAVE_OFF + 8*8) (sp) ; \ 

SR £14 f22( opcode ) 
ttdefine SR f 14_f 24 ( opcode ) \ 

opcode f24, (FPR SAVE_OFF + 7*8) (sp) ; \ 

SR fl4 f23( opcode ) 
#define SR f 14_f 25 ( opcode ) \ 

opcode f25, (FPR SAVE_OFF + 6*8) (sp)'; \ 

SR fl4 f24( opcode ) 
#define SR fl4_f26( opcode ) \ 

opcode f26, (FPR SAVE_OFF + 5*8) (sp) ; \ 

SR fl4 f25( opcode ) 
#define SR f 14_f 27 ( opcode ) \ 

opcode f27, (FPR SAVEJDFF + 4*8) (sp) ; \ 

SR fl4 f26( opcode ) 
#define SR f 14_f 28 ( opcode ) \ 

opcode f28, (FPR SAVE_OFF + 3*8) (sp) ; \ 

SR f!4 f27( opcode ) 
tfdefine SR f 14_f 29 ( opcode ) \ 

opcode f29, (FPR SAVE_OFF + 2*8) (sp) ; \' 

SR fl4 f28( opcode ) 
#define SR f 14_f 30 ( opcode ) \ 

opcode f30, (FPR SAVEJ3FF + 1*8) (sp) ; \ 

SR fl4 f29( opcode ) 
#define SR f 14_f 31 ( opcode ) \ 

opcode f31, (FPR SAVE_OFF) (sp) ; \ 

SR_fl4_f30( opcode ) 

/* 

* macros for saving and restoring non-volatile 

* general purpose registers (GPRs) 
*/ 

#if defined ( VOLATILE_rl3 ) 
#define SAVE rl3 

#def ine SAVE rl3 rl4 SR rl4 ( stw ) 
#define SAVE rl3 rl5 SR rl4 rl5 ( stw ) 
#define SAVE rl3 r!6 SR rl4 rl6 ( stw ) 
#define SAVE rl3 rl7 SR rl4 rl7{ stw ) 
#define SAVE rl3 rl8 SR rl4 rl8 ( stw ) 
#define SAVE rl3 rl9 SR rl4 rl9( stw ) 
#define SAVE_rl3_r20 SR_rl4_r20 ( stw ) 
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#define 


SAVE 


rl3 


r21 


SR 


rl4 


r21( 


stw 


#def ine 


SAVE 


rl3 


r22 


SR 


rl4 


x22 { 


stw 


#def ine 


SAVE 


rl3 


r23 


SR 


rl4 


r23 ( 


stw 


#def ine 


SAVE 


rl3 


r24 


SR 


rl4 


r24 ( 


stw 


#def ine 


SAVE 


rl3 


r25 


SR 


rl4 


r25{ 


stw 


#def ine 


SAVE 


r!3 


r26 


SR 


rl4 


r26( 


stw 


ftdefine 


SAVE 


rl3 


r27 


SR 


rl4 


r27( 


stw 


#define 


SAVE 


rl3 


r28 


SR 


rl4 


r28( 


stw 


#define 


SAVE 


rl3 


r29 


SR 


rl4 


r29( 


stw 


#define 


SAVE 


rl3 


r30 


SR 


rl4 


r30{ 


stw 


#define 


SAVE_ 




_r31 


SR_ 


r!4 


r31{ 


stw 


#def ine 


REST 


rl3 












#define 


REST 


rl3 


rl4 


SR 


rl4 ( lwz 


) 


#def ine 


REST 


rl3 


rl5 


SR 


rl4 


rl5 ( 


lwz 


#define 


REST 


r!3 


rl€ 


SR 


rl4 


rl6( 


lwz 


#def ine 


REST 


rl3 


rl7 


SR 


rl4 


rl7( 


lwz 


tfdefine 


REST 


rl3 


rl8 


SR 


rl4 


rl8( 


lwz 


#def ine 


REST 


rl3 


r!9 


SR 


rX4 


rl9( 


lwz 


#define 


REST 


rl3 


r20 


SR 


rl4 


r20( 


lwz 


#def ine 


REST 


rl3 


r21 


SR 


rl4 


r21 ( 


lwz 


#def ine 


REST 


rl3 


r22 


SR 


rl4 


r22 ( 


lwz 


#def ine 


REST 


rl3 


r23 


SR 


rl4 


r23( 


lwz 


#def ine 


REST 


rl3 


r24 


SR 


r!4 


r24( 


lwz 


#def ine 


REST 


rl3 


r25 


SR 


rl4 


r25( 


lwz 


#def ine 


REST 


rl3 


r26 


SR 


rl4 


r26( 


lwz 


#def ine 


REST 


rl3 


r27 


SR 


rl4 


r27 ( 


lwz 


#def ine 


REST 


rl3 


r28 


SR 


rl4 


r28 ( 


lwz 


#def ine 


REST 


r!3 


r29 


SR 


rl4 


r29( 


lwz 


#define 


REST 


rl3 


r30 


SR 


rl4 


r30{ 


lwz 


#define 


REST_ 


rl3 


r31 


SR 


rl4 


r31( 


lwz 


#else 
















#def ine 


SAVE 


rl3 


SR 


rl3( stw ) 




#def ine 


SAVE 


rl3 


rl4~ 


SR 


rl3 


rl4{ 


stw 


#def ine 


SAVE 


rl3 


rl5 


SR 


rl3 


rl5( 


stw 


#def ine 


SAVE 


rl3 


r!6 


SR 


rl3 


r!6( 


stw 


#def ine 


SAVE 


rl3 


r!7 


SR 


rl3 


rl7( 


stw 


#define 


SAVE 


rl3 


rl8 


SR 


rl3 


rl8 ( 


stw 


#def ine 


SAVE 


rl3 


rl9 


SR 


r!3 


rl9( 


stw 


#def ine 


SAVE 


rl3 


r20 


SR 


rl3 


r20( 


stw 


#define 


SAVE 


rl3 


r21 


SR 


rl3 


r21( 


stw 


#def ine 


SAVE 


r!3 


r22 


SR 


rl3 


r22( 


stw 


#def ine 


SAVE 


rl3 


r23 


SR 


rl3 


r23( 


stw 


#define 


SAVE 


rl3 


r24 


SR 


rl3 


r24( 


stw 


#def ine 


SAVE 


rl3 


r25 


SR 


rl3 


r25( 


stw 


#def ine 


SAVE 


rl3 


r26 


SR 


rl3 


r26( 


stw 


#def ine 


SAVE 


rl3 


r27 


SR 


rl3 


r27< 


stw 


#def ine 


SAVE 


rl3 


r28 


SR 


rl3 


r28( 


stw 


#def ine 


SAVE 


rl3 


r29 


SR 


rl3 


r29( 


stw 


#def ine 


SAVE 


rl3 


r30 


SR 


rl3 


r30( 


stw 


#def ine 


SAVE_ 


_rl3_ 


_r31 


SR. 


rl3 


r31( 


stw 


#def ine 


REST 


rl3 


SR 


rl3 


f lwz ) 




ffderme 


REST 


rl3 


rl4~ 


" SR 


rl3 


rl4( 


lwz 


#def ine 


REST 


rl3 


rl5 


SR 


rl3 


rl5( 


lwz 


#def ine 


REST 


rl3 


rl6 


SR 


rl3 


rl6( 


lwz 


#define 


REST 


rl3 


rl7 


SR 


rl3 


r!7( 


lwz 


^define 


REST 


rl3 


rl8 


SR 


rl3 


rl8( 


lwz 


tfdefine 


REST 


rl3 


rl9 


SR 


rl3 


rl9( 


lwz 


#define 


REST 


rl3 


r20 


SR 


rl3 


r20( 


lwz 


#define 


REST 


r!3 


r21 


SR rl3 


r21( 


lwz 


#define 


REST 


r!3 


r22 


SR 


rl3 


r22( 


lwz 


#define 


REST 


r!3 


r23 


SR 


rl3 


r23( 


lwz 


#define 


REST 


rl3 


r24 


SR 


rl3 


r24( 


lwz 


#def ine 


REST 


r!3 


r25 


SR 


rl3 


r25( 


lwz 
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#def ine REST 


rl3 


r26 


SR 


rl3 


r26( 


lwz 


) 


#def ine REST 


rl3 


r27 


SR 


rl3 


r27{ 


lwz 


) 


#define REST 


rl3 


r28 


SR 


rl3 


r28( 


lwz 


) 


#define REST 


rl3 


r29 


SR 


rl3 


r29( 


lwz 


) 


#define REST 


rl3 


r30 


SR 


rl3 


r30( 


lwz 


) 


#define REST_ 


rl3 


r31 


SR 


rl3 


r31( 


lwz 


) 



/* 

* macros common to both GPR save and restore 
*/ 

#def ine SR rl3 ( opcode ) \ . 

opcode rl3 f (GPR_SAVE OFF + 18*4) (sp) ; 
#define SR rl3 < _rl4 ( opcode ) \ 

opcode rl4, (GPR_SAVE_OFF + 17*4) (sp) ; \ 

SR rl3 ( opcode ) 
#define SR rl3_rl5( opcode ) \ 

opcode rl5, (GPR SAVE_0FF + 16*4) (sp) ; \ 

SR rl3 rl4 ( opcode ) 
#define SR rl3_rl6( opcode ) \ 

opcode rl6, (GPR SAVE_OFF + 15*4) (sp) ; \ 

SR rl3 r!5( opcode ) 
#define SR rl3_rl7( opcode ) \ 

opcode rl7, (GPR SAVE_0FF + 14*4) (sp) ; \ 

SR rl3 rl6( opcode ) 
#define SR rl3_rl8( opcode ) \ 

opcode rl8, (GPR SAVEJ3FF + 13*4) (sp) ; \ 

SR rl3 rl7( opcode ) 
#define SR rl3_rl9( opcode ) \ 

opcode r!9, (GPR SAVE_OFF + 12*4) (sp) / \ 

SR rl3 rl8 ( opcode ) 
#define SR rl3'_r2 0{ opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp) ; \ 

SR rl3 rl9( opcode ) 
#define SR rl3_r21{ opcode ) \ 

opcode r21, (GPR SAVEJ3FF + 10*4) (sp) ; \ 

SR rl3 r20 ( opcode ) 
#def ine SR rl3_r22 ( opcode ) \ 

opcode r22, (GPR SAVEJDFF + 9*4) (sp) ; \ 

SR r!3 r21{ opcode ) 
#define SR rl3_r23 ( opcode ) \ 

opcode r23, (GPR SAVEJDFF + 8*4) (sp) ; \ 

SR rl3 r22 ( opcode ) 
#def ine SR rl3_r24 ( opcode ) \ 

opcode r24, (GPR SAVEJ3FF + 7*4) (sp) \ 

SR rl3 r23 ( opcode ) 
#define SR rl3_r25( opcode ) \ 

opcode r25, (GPR SAVEJ3FF + 6*4) (sp) ; \ 

SR rl3 r24 ( opcode ) 
^define SR rl3_r2 6( opcode ) \ 

opcode r26, (GPR SAVE__OFF + 5*4) (sp) ; \ 

SR rl3 r25( opcode ) 
^define SR rl3_r27 ( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl3 r26( opcode ) 
#define SR rl3_r28( opcode ) \ 

opcode r28 r (GPR SAVE_0FF + 3*4) (sp) ; \ 

SR rl3 r27( opcode ) 
ttdefine SR rl3_r29( opcode ) \ 

Opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl3 r28 ( opcode ) 
#define SR rl3_r3 0( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp) ; \ 

SR rl3 r29( opcode ) 
#define SR rl3_r31( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 

SR_rl3_r30( opcode ) 
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tf enan 
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Stw 
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SAVE 


rl4 
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stw 


ffderine 


SAVE 
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rl4 
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stw 


ftaerme 


SAVE 


rl4 


rl8 


SR 


rl4 


rl8 ( 


stw 


ffdetine 


SAVE 


rl4 


rl9 


SR 


rl4 


rl9( 


stw 


#define 


SAVE 


rl4 


r20 


SR 


rl4 


r20( 


stw 


#define 


SAVE 


rl4 


r21 


SR 


rl4 


r21( 


stw 


ffdetine 


SAVE 
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r22 


SR 


rl4 


r22( 


stw 


ffdetine 


SAVE 


rl4 
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SR 


rl4 


r23( 


stw 


ffdetine 


SAVE 


rl4 


r24 


SR 


rl4 
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stw 


ffdetine 


SAVE 


rl4 


r25 


SR 


rl4 


r25( 


stw 


ffdetine 


SAVE 


rl4 


r26 


SR 


rl4 


r26( 


stw 


ffdetine 


SAVE 


rl4 


r27 


SR 


rl4 


r27 ( 


stw 


#def ine 


SAVE 


rl4 


r28 


SR 


rl4 


r28( 


stw 


ffdetine 


SAVE 


rl4 


r29 


SR 


rl4 


r29( 


stw 


#def ine 


SAVE 


rl4 


r30 


SR 


rl4 
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stw 


ffdetine 


SAVE_ 


_rl4_ 


_r31 


SR 


r!4 
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stw 


ffdetine 


REST 


rl4 


SR 


r!4( lwz ) 




ffdetine 


REST 
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SR 


rl4 
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lwz 


ftdef ine 


REST 


rl4 
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SR 


r!4 


rl6( 


lwz 
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ffdetine 


REST 


rl4 


rl7 


SR 


rl4 


rl7( 


lwz 
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ffdetine 


REST 


rl4 


rl8 


SR 


rl4 


rl8( 


lwz 


#def ine 


REST 


rl4 


rl9 


SR 


rl4 


rl9( 


lwz 


#def ine 


REST 


rl4 


r20 


SR 


rl4 


r20( 


lwz 


#def ine 


REST 


rl4 


r21 


SR 


rl4 


r21( 


lwz 


ftdef ine 


t3T?GT 

Rfc»o 1 


rl4 


r22 


SR 


rl4 


r22( 


lwz 


#def ine 


REST 


rl4 


r23 


SR 


rl4 


r23 ( 


lwz 


#def ine 


REST 


rl4 


r24 


SR 


rl4 


r24< 


lwz 


#define 


REST 


rl4 
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SR 
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lwz 


#def ine 


REST 


rl4 


r26 


SR 
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r26{ 


lwz 


#def ine 


REST 


rl4 


r27 


SR 
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r27( 


lwz 


#def ine 


REST 


rl4 


r28 


SR 
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lwz 


#def ine 


REST 
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SR 
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#define 


REST 
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SR 
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#def ine 


REST 
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/* 
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/* end V0LATILE_rl3 */ 



* macros common to both GPR save and restore 
*/ 

#def ine SR rl4 ( opcode > \ 

opcode rl4, (GPR_SAVE OFF + 17*4) (sp) ; 
#define SR rl4_rl5 ( opcode ) \ 

opcode rl5, (GPR_SAVE_OFF + 16*4) (sp) ; \ 

SR rl4( opcode ) 
#define SR rl4_rl6 { opcode ) \ 

opcode rl6, (GPR SAVEjOFF + 15*4) (sp) ; \ 

SR rl4 rl5( opcode ) 
#define SR rl4_rl7( opcode ) \ 

opcode rl7, (GPR SAVEJDFF + 14*4) (sp) ; \ 

SR rl4 r!6 ( opcode ) 
#define SR rl4_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE__OFF + 13*4) (sp) ; \ 

SR rl4 rl7( opcode ) 
#define SR rl4_rl9( opcode ) \ 

opcode rl9, (GPR SAVEJ3FF + 12*4) (sp) ; \ 

SR rl4 rl8( opcode ) 
^define SR rl4_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp) ; \ 

SR rl4 rl9( opcode ) 
#define SR rl4_r21( opcode ) \ 

opcode r21, (GPR SAVEJDFF + 10*4) (sp) ; \ 

SR rl4 r20( opcode ) 
ftdefine SR_rl4_r22 ( opcode ) \ 
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opcode r22, (GPR SAVEJDFF + 9*4) (sp) ; \ 

SR rl4 r21( opcode ) 
#def ine SR rl4_r23 ( opcode } \ 

opcode r23, (GPR SAVE_OFF + 8*4) (sp) ; \ 

SR rl4 r22( opcode ) 
#def ine SR rl4_r24 ( opcode ) \ 

opcode r24, (GPR SAVEJDFF + 7*4) (sp) ; \ 

SR rl4 r23( opcode ) 
ftdefine SR rl4_r25( opcode ) \ 

opcode r25, (GPR SAVE_OFF + 6*4) (sp) ; \ 

SR rl4 r24 ( opcode ) 
#define SR rl4_r26( opcode ) \ 

opcode r26, (GPR SAVEJDFF + 5*4) (sp) ; \ 

SR rl4 r25( opcode ) 
#define SR rl4_r27( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR r!4 r26 ( opcode ) 
#define SR rl4_r28( opcode ) \ 

opcode r28, (GPR SAVEJDFF + 3*4) (sp) ; \ 

SR rl4 r27( opcode ) 
#define SR rl4_r29( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl4 r28( opcode ) 
#define SR rl4_r30( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp) ; \ 

SR rl4 r29( opcode ) 
#define SR rl4_r31( opcode ) \ 

opcode r3l, (GPR SAVE_OFF) (sp) ; \ 

SR_rl4_r30( opcode ) 



#def ine 
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#define 
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SR 
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stw 


ttdefine 
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SR 
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#def ine 


SAVE 
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* macros common to both GPR save and restore 
*/ 

#define SR rl5( opcode ) \ 

opcode rl5, (GPR__SAVE OFF + 16*4) (sp) ; 
#de£ine SR rl5_rl6 ( opcode ) \ 

opcode rl6, (GPR^SAVEjOFF + 15*4) (sp) ; \ 

SR r!5 ( opcode ) 
#define SR rl5_rl7( opcode ) \ 

opcode rl7, (GPR SAVEJDFF + 14*4) (sp) ; -\ 

SR rlS rl6( opcode) 
#define SR rl5_rl8( opcode ) \ 

opcode rl8, (GPR SAVEJDFF + 13*4) (sp) ; \ 

SR rl5 rl7( opcode ) 
#define SR rl5_rl9 ( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl5 rl8 ( opcode ) 
#define SR rl5_r20( opcode ) \ 

opcode r20, (GPR SAVEJDFF + 11*4) (sp) ; \ 

SR rl5 rl9( opcode ) 
#define SR rl5_r21( opcode ) \ 

Opcode r21, (GPR SAVEJDFF + 10*4) (sp) ; \ 

SR rl5 r20( opcode ) 
#def ine SR rl5_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_OFF + 9*4) (sp) ; \ 

SR rl5 r21 ( opcode ) 
#define SR rl5_r23( opcode ) \ 

opcode r23, (GPR SAVEJDFF + 8*4) (sp) ; \ 

SR rl5 r22 ( opcode ) 
#def ine SR rl5_r24 { opcode ) \ 

opcode r24, (GPR SAVE_OFF + 7*4) (sp) ; \ 

SR rlS r23 ( opcode ) 
#define SR rl5_r25 ( opcode ) \ 

opcode r25, (GPR SAVEJ3FF + 6*4) (sp) ; \ 

SR rl5 r24 { opcode ) 
#define SR rl5_r26 { opcode ) \ 

opcode r26, (GPR SAVEJDFF + 5*4) (sp) ; \ 

SR rl5 r25 ( opcode ) 
#define SR rl5_r27 ( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl5 r26 ( opcode ) 
#def ine SR rl5_r28 ( opcode ) \ 

opcode r28, (GPR SAVEJ)FF + 3*4) (sp) ; \ 

SR rl5 r27( opcode ) 
ttdefine SR rl5_r29 ( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl5 r28( opcode ) 
ttdefine SR rl5_r30( opcode ) \ 

opcode r30, (GPR SAVEJDFF + 1*4) (sp) ; \ 

SR rl5 r29( opcode ) 
#define SR rl5_r31( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 
. SR_rl5_r30{ opcode ) 
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/* 

* macros common to both GPR save and restore 
*/ 

#def ine SR rl6 ( opcode ) \ 

opcode rl6, (GPR_SAVE OFF + 15*4) 
#define SR rl6_rl7 { opcode ) \ 

opcode rl7, (GPR_SAVE_OFF + 14*4) 

SR rl6{ opcode ) 
tfdefine SR rl6_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVEJ3FF + 13*4) 

SR r!6 rl7( opcode ) 
#define SR rl6_rl9 ( opcode ) \ 

opcode rl9, <GPR SAVE_0FF + 12*4) 

SR rl6 rl8{ opcode ) 
#define SR rl6_r20( opcode ) \ 

opcode r20, (GPR SAVE__0FF + 11*4) 

SR rl6 rl9{ opcode ) 
fldefine SR rl6_r21 ( opcode ) \ 

opcode r21, (GPR SAVEJDFF + 10*4) 

SR rl6 r20( opcode ) 
#define SR rl6_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_0FF + 9*4) (sp) ; \ 

SR r!6 r21( opcode ) 
#define SR rl6_r23( opcode ) \ 

opcode r23, (GPR SAVE_0FF + 8*4) (sp) ; \ 

SR rl6 r22 ( opcode ) 
^define SR rl6_r24 ( opcode ) \ 

opcode r24, (GPR SAVEJDFF + 7*4) (sp) ; \ 

SR rl6 r23( opcode ) 
#define SR rl6_r25( opcode ) \ 

opcode r25, (GPR SAVEJDFF + 6*4) (sp) ; \ 

SR rl6 r24( opcode ) 
#define SR rl6_r26( opcode ) \ 

opcode r26, (GPR SAVE_OFF + 5*4) (sp) ; \ 

SR rl6 r25( opcode ) 
#define SR rl6_r27( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl6 r26( opcode ) 
#define SR rl6_r2 8( opcode ) \ 

opcode r28, (GPR SAVE_OFF + 3*4) (sp) ; \ 

SR rl6 r27( opcode ) 
#define SR rl6_r29( opcode ) \ 

opcode r29, (GPR SAVEJDFF + 2*4) (sp) ; \ 

SR rl6 r28 ( opcode ) 
#define SR rl6_r30( opcode ) \ 

opcode r30, (GPR SAVEJDFF + 1*4) (sp) ; \ 

SR_rl6_r29( opcode ) 



(sp) ; 

(sp); \ 

(sp); \ 

(sp); \ 

(sp); \ 

(sp); \ 
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#define SR rl6_r31( opcode ) \ 

opcode r31, (GPR SAVEJ3FF) (sp) ; \ 
SR__rl6_r30( opcode ) 

#if defined ( BUILD_MAX ) 
/* 

* macros for saving and restoring non-volatile 

* vector registers (VRs) 

* (uses rO as scratch register) 
*/ 
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/* 















* macros common to both VR save and restore 

* (uses rO as scratch register) 
*/ 

#define SR v20( opcode ) \ 

li rO, (VR SAVEJDFF + 11*16); \ 

opcode v20, sp, rO; 
#define SR v20 v21( opcode ) \ 

li rO, (VR SAVE_OFF + 10*16); \ 

opcode v21, sp, rO; \ 

SR v20( opcode ) 
#def ine SR v20 v22 ( opcode ) \ 

li rO, (VR SAVE_OFF + 9*16); \ 

opcode v22, sp f rO; \ 

SR v20 v21( opcode ) 
#def ine SR v20 v23 ( opcode ) \ 

li rO, (VR SAVE_OFF + 8*16); \ 

opcode v23, sp # rO; \ 

SR v20 v22( opcode ) 
#def ine SR v20 v24 ( opcode ) \ 

li rO, (VR SAVE_OFF + 7*16); \ 

opcode v24, sp, rO; \ 

SR v20 v23( opcode ) 
#define SR v20 v25 ( opcode ) \ 

li rO, (VR SAVE_OFF + 6*16); \ 

opcode v25, sp, rO; \ 

SR v20 v24 ( opcode ) 
#define SR v20 v26 ( opcode ) \ 

li rO, (VR SAVE_OFF + 5*16); \ 

opcode v26, sp, rO; \ 
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SR v20 v25 ( opcode ) 
#define SR v20 v27 ( opcode ) \ 

li rO, (VR SAVE_OFF + 4*16); \ 

opcode v27, sp, rO; \ 

SR v20 v26( opcode ) 
#define SR v20 v28 { opcode ) \ 

li rO, (VR SAVE_OFF + 3*16); \ 

opcode v28 / sp, rO; \ 

SR v20 v27( opcode ) 
#define SR v20 v29 ( opcode ) \ 

li rO, (VR SAVE_OFF + 2*16); \ 

opcode v29, sp, rO; \ 

SR v20 v28 ( opcode ) 
#define SR v20 v30( opcode ) \ 

li rO, (VR SAVE_OFF + 1*16); \ 

opcode v30, sp, rO; \ 

SR v20 v29( opcode ) 
#define SR v20 v31( opcode ) \ 

li rO, (VR SAVEJDFF) ; \ 

opcode v31, sp, rO; \ 

SR_v20_v30( opcode ) 

/* 

* macros for saving, updating and restoring VRSAVE and saving and 

* restoring non- volatile vector registers (vO - v31) 

* (destroys rO and CRO field of CR) 
*/ 

#define NON VOLATILE VR TEST( last vreg ) \ 

andi. rO, rO, ((-1 « (31 - (last_vreg) ) ) & OxOf f f ) ; 

#define RECORD vO vl5( last_vreg ) \ 

oris rO, rO, ((-1 « (15 - (lastjvreg) ) ) & Oxff f f ) ; \ 
mtspr %VRSAVE, rO; 

#define RECORD vl6 v31( last_vreg ) \ 
oris rO, rO, Oxffff; \ 

ori rO, rO, ((-1 « (31 - (last_vreg) ) ) & Oxffff); \ 
mtspr %VRSAVE, rO; 

#define USE vO vl5 ( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
Stw rO, VRSAVE_SAVE OFF(sp); \ 
RECORD_v0_vl5 ( last_vreg ) 

#define USE vl6 vl9( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
stw rO, VRSAVE SAVE OFF(sp) ; \ 
RECORD_vl6_v31 ( last_vreg ) 

#define FREE_vO_vl9{ cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET ( 8 ); \ 
Iwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

/* 

* user-callable macros 
*/ 

#define USE THRU vO ( cond ) USE vO vl5( cond, 0 ) 
#define USE THRU vl ( cond ) USE vO vl5 ( cond, 1 ) 
#def ine USE THRU v2 ( cond ) USE vO vl5 ( cond, 2 ) 
ttdefine USE THRU v3 ( cond ) USE vO vl5( cond, 3 ) 
#define USEJTHRU_v4 ( cond ) USE_v0_yl5( cond, 4 ) 
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#define USE THRU v20( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 32 ) ,- 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST( 20 ) 
beq PC OFFSET (16) ; 
\ 

SAVE v20 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 2 0 ) 



#define USE THRU v21 ( cond ) 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET ( 4 0 ) , 



\ 



stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 21 ) 
beq PC OFFSET (24) ; 
\ 

SAVE V20 V21 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 21 ) 

#def ine USE THRU v22 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), r0 r 0; \ 
beq ( cond ) , PC_OFFSET (48); 

Stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST( 22 ) 
beq PCJ2FFSET (32); 

SAVE v20 v22 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 22 ) 

#def ine USE THRU v23 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi ( cond) , rO , 0 ; \ 
beq (cond), PC_OFFSET(56) ; 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 23 ) 
beq PC OFFSET (40) ; 
\ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v20 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v21 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v21 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v22 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO -,v22 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v23 in use? */ \ 

/* no, cond is set to greater than */ 
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SAVE v20 v23 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE , - 
RECORD_vl6_v31( 23 ) 

#def ine USE THRU v24 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET(64) ; 
\ 

Stw rO, VRSAVE SAVE OFP(sp) ; \ 
NONVOLATILE VR TEST ( 24 ) 
beq PC_OFFSET(48) ; 

SAVE V20 V24 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 24 ) 

^define USE THRU v25 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (72) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp) ; \ 
NONVOLATILE VR TEST ( 25 ) 
beq PC_OFFSET(56) ; 

SAVE v20 V25 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6jv31 ( 25 ) 

ftdefine USE THRU v2 6( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (80) / 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 26 ) 
beq PC OFFSET (64) ; 
\ 

SAVE v20 v26 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v 31 < 26 > 

ttdefine USE THRU v27 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 88 ) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 27 ) 
beq PC OFFSET (72) ; 
\ 

SAVE v20 v27 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31( 27 ) 

#define USE THRU v28( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PCJDFFSET (96) ; 

stw rO, VRSAVEjSAVE_OFF(sp) ; \ 



/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v23 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v24 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v24 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v25 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v25 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v26 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v26 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v27 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v27 in use */ 



/* cond set to equal if VRSAVE = 0 */ 
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NONVOLATILE VR TEST ( 28 ) 

beq PC_OFFSET(80) ; 

\ 

SAVE V20 v28 

cmpwi (cond) , rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_yl6_v31{ 28 ) 

#define USE THRU v29 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond) , rO, 0; \ 
beg (cond), PC OFFSET (104) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 29 ) 
beq PC_OFFSET(88) ; 
\ 

SAVE v20 v29 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE ; 
RECORD_vl6_v31( 29 ) 

#define USE THRU v30 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi ( cond) , rO , 0 ; \ 
beq (cond), PC OFFSET (112) ; 
\ 

Stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 30 ) 
beq PC OFFSET (96) ; 
\ 

SAVE v20 v30 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 3 0 ) 

#define USE THRU v31( cond ) \ 
mfspr r0,' %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET (120) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 31 ) 
beq PC_OFFSET(104) ; 

SAVE v20 v31 

cmpwi (cond), rO, 0x7fff,- 
mfspr rO, %VRSAVE; 
RECORD_vl6 v31( 31 ) 



/* v20 - v28 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate v0 - v28 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v29 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v29 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* 
/* 



v20 - v30 in use? */ \ 

no, cond is set to greater than */ 



/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate v0 - v30 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v31 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v31 in use */ 



#define 


FREE 


THRU 


vO ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


vl ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


ftdefine 


FREE 


THRU 


v2 ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#def ine 


FREE 


THRU 


v3 ( cond ) 


FREE 


vO 


vl9( 


cond ) 


tdefine 


FREE 


THRU 


v4 ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


v5 ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#def ine 


FREE 


THRU 


v6 ( cond ) 


FREE 


vO 


vl9( 


cond ) 


#define 


FREE 


THRU 


v7 ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


v8 ( cond ) 


FREE 


V0 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


v9 ( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#def ine 


FREE 


THRU 


vl0( cond ) 


FREE 


vO 


vl9{ 


cond ) 


#def ine 


FREE 


THRU 


vll( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


vl2( cond ) 


FREE 


vO 


vl9( 


cond ) 


#define 


FREE 


THRU 


vl3( cond ) 


FREE 


vO 


vl9 ( 


cond ) 


#define 


FREE 


THRU 


vl4( cond ) 


FREE 


vO 


vl9( 


cond ) 


#define 


FREE 


THRU 


v!5( cond ) 


FREE 


vO 


vl9( 


cond ) 


#define 


FREE_ 


_THRU 


_vl6( cond ) 


FREE 


vO 


vl9( 


cond ) 
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#define FREE THRU v!7 ( cond ) FREE vO vl9 ( cond ) 
#define FREE THRU vl8 ( cond ) FREE vO vl9 ( cond ) 
#define FREE_THRU_vl9 ( cond ) FREE_vO_vl9 ( cond ) 

#define FREE_THRU_v20 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (20) ; \ 
bgt (cond), PC_OFFSET(12) ; \ 
REST v2 0 ; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v21 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (28) ; \ 
bgt (cond), PC OFFSET (20) ; \ 
REST v20 v21; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v22 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (36) ; \ 
bgt (cond), PC OFFSET (28); \ 
REST v20 V22; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_jv23 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (44) ; \ 
bgt (cond), PC OFFSET (36) ; \ 
REST v20 V23; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#def ine FREE_THRU_v24 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (52) ; \ 
bgt (cond), PC OFFSET (44) ; \ 
REST v20 V24; \ 

lwz rO, VRSAVE_SAVEj3FF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 5 { cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (60) ,* \ 
bgt (cond), PC OFFSET(52); \ 
REST v20 v25; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 6 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (68); \ 
bgt (cond), PC OFFSET (60); \ 
REST v20 v26; V 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v27 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (76); \ 
bgt (cond), PC OFFSET (68) ; \ 
REST v20 V27; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 
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#define FREE_THRTJ__v2 8 ( cond ) \ 
li rO, 0; \ 

beg (cond), PC OPFSET(84); \ 
bgt (cond), PC OFFSET{76); \ 
REST V20 V28; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREEJTHRU_v2 9 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (92) ; \ 
bgt (cond), PC OFFSET (84); \ 
REST v20 v29; \ 

lwz rO, VRSAVE_SAVEJDFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v3 0 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (100) ; \ 
bgt (cond), PC OFFSET (92) ; \ 
REST v20 v30; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

ftdefine FREE__THRU__v3 1 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (108) ; \ 
bgt (cond), PC OFFSET (100) ; \ 
REST v20 v31; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#endif 

/* 

* macros to save and restore the CR register 

* (uses rO as scratch register) 

#define SAVE CR \ 
mfcr rO; \ 

Stw rO, CR_SAVE_OFF(sp) ; 

#define REST CR \ 

lwz rO, CR_SAVE__OFF(sp) ; \ 
mtcr rO ; 



/* end BUILD_MAX */ 



* macros to save and restore the LR register 

* (uses rO as scratch register) 



#define SAVE LR \ 
mflr rO; \ 

stw rO, LR_SAVE_OFF(sp) ; 

#define REST LR \ 

lwz rO, LR_SAVE_OFF(sp) ; \ 
mtlr rO; 



#endif 



/* end COMPILE_C */ 



macros for declaring GPR, FPR and VMX registers 



/* 

* declare rO 
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*/ 

#define DECLARE_rO 
/* 

* r3 declare set 
*/ 



#def ine 


DECLARE 


r3 




#define 


DECLARE 


r3 


r4 


ttdef ine 


DECLARE 


r3 


r5 


#def ine 


DECLARE 


r3 


r6 


#define 


DECLARE 


r3 


r7 


#define 


DECLARE 


r3 


r8 


#def ine 


DECLARE 


r3 


r9 


#define 


DECLARE 


r3 


rlO 


#def ine 


DECLARE 


r3 


rll 


#def ine 


DECLARE 


r3 


rl2 


#define 


DECLARE 


r3 


rl3 


#def ine 


DECLARE 


r3 


rl4 


#define 


DECLARE 


r3 


rl5 


#def ine 


DECLARE 


r3 


rl6 


#define 


DECLARE 


r3 


rl7 


#def ine 


DECLARE 


r3 


rl8 


#def ine 


DECLARE 


r3 


rl9 


#def ine 


DECLARE 


r3 


r20 


#def ine 


DECLARE 


r3 


r21 


#define 


DECLARE 


r3 


r22 


#define 


DECLARE 


r3 


r23 


#define 


DECLARE 


r3 


r24 


#def ine 


DECLARE 


r3 


r25 


#def ine 


DECLARE 


r3 


r26 


ttdefine 


DECLARE 


r3 


r27 


#def ine 


DECLARE 


r3 


r28 


#def ine 


DECLARE 


r3 


r29 


#def ine 


DECLARE 


r3 


r30 


#define 


DECLARE 


r3 


r31 



/* 

* r4 declare set 
*/ 



#define 


DECLARE 


r4 




#def ine 


DECLARE 


r4 


r5 


#define 


DECLARE 


r4 


r6 


#define 


DECLARE 


r4 


r7 


#define 


DECLARE 


r4 


r8 


#define 


DECLARE 


r4 


r9 


#define 


DECLARE 


r4 


rlO 


#define 


DECLARE 


r4 


rll 


#def ine 


DECLARE 


r4 


rl2 


ttdefine 


DECLARE 


r4 


rl3 


#define 


DECLARE 


r4 


rl4 


#def ine 


DECLARE 


r4 


rl5 


#def ine 


DECLARE 


r4 


rl6 


ttdefine 


DECLARE 


r4 


rl7 


#define 


DECLARE 


r4 


rl8 


#define 


DECLARE 


r4 


rl9 


#define 


DECLARE 


r4 


r20 


#define 


DECLARE 


r4 


r21 


#def ine 


DECLARE 


r4 


r22 


#def ine 


DECLARE 


r4 


r23 


#define 


DECLARE 


r4 


r24 


ttdefine 


DECLARE 


r4 


r25 


#define 


DECLARE 


r4 


r26 


ftdefine 


DECLARE 


r4 


r27 


#define 


DECLARE 


r4 


r28 


#define 


DECLARE 


r4 


r29 


#define 


DECLARE 


r4 


r30 


#def ine 


DECLARE 


r4 


r31 
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/* 

* r5 declare set 
*/ 



#def ine 


DECLARE 


r5 




#def ine 


DECLARE 


r5 


r6 


^define 


DECLARE 


r5 


r7 


$def ine 


DECLARE 


r5 


r8 


ifcdef ine 


DECLARE 


r5 


r9 


#def ine 


DECLARE 


r5 


rlO 


^define 


DECLARE 


r5 


rll 


#define 


DECLARE 


r5 


rl2 


#def ine 


DECLARE 


r5 


rl3 


#define 


DECLARE 


r5 


rl4 


#define 


DECLARE 


r5 


rl5 


^define 


DECLARF 


r5 


rl6 


#define 


DECLARE 


r5 


rl7 


#def ine 


DECLARE 


r5 


rl8 


#def ine 


DECLARE 


r5 


rl9 


#def ine 


DECLARE 


r5 


r20 


#define 


DECLARE 


r5 


r21 


#def ine 


DECLARE 


r5 


r22 


#define 


DECLARE 


r5 


r23 


#def ine 


DECLARE 


r5 


r24 


#define 


DECLARE 


r5 


r25 


#def ine 


DECLARE 


r5 


r26 


#def ine 


DECLARE 


r5 


r27 


#def ine 


DECLARE 


r5 


r28 


#def ine 


DECLARE 


r5 


r29 


#define 


DECLARE 


r5 


r30 


#def ine 


DECLARE 


r5 


r31 


/* 

.* r6 declare set 
+ / 




/ 

#def ine 


DECLARE 


r6 




#define 


DECLARE 


r6 


r7 


#def ine 


DECLARE 


r6 


r8 


#def ine 


DECLARE 


r6 


r9 


#def ine 


DECLARE 


r6 


rlO 


#define 


DECLARE 


r6 


rll 


#def ine 


DECLARE 


r6 


rl2 


#def ine 


DECLARE 


r6 


rl3 


#define 


DECLARE 


re 


rl4 


#def ine 


DECLARE 


r6 


rl5 


#define 


DECLARE 


r6 


rl6 


^define 


DECLARE 


r6 


rl7 


#define 


DECLARE 


r6 


rl8 


#define 


DECLARE 


r6 


rl9 


#def ine 


DECLARE 


r6 


r20 


#def ine 


DECLARE 


r6 


r21 


#def ine 


DECLARE 


r6 


r22 


#def ine 


DECLARE 


r6 


r23 


#define 


DECLARE 


r6 


r24 


#def ine 


DECLARE 


r6 


r25 


#def ine 


DECLARE 


r6 


r26 


#def ine 


DECLARE 


r6 


r27 


#def ine 


DECLARE 


r6 


r28 


#def ine 


DECLARE 


r6 


r29 


#def ine 


DECLARE 


r6 


r30 


#def ine 


DECLARE 


r6 


r31 


/* 









* r7 declare set 
*/ 

#define DECLARE r7 
ttdefine DECLARE r7 r8 
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#def ine 



DECLARE 



r7 



r9 



#def ine 


DECLARE 


r7 


rlO 


#def ine 


DECLARE 


r7 


rll 


#def ine 


DECLARE 


r7 


r!2 


#def ine 


DECLARE 


r7 


r!3 


#def ine 


DECLARE 


r7 


r!4 


#def ine 


DECLARE 


r7 


rl5 


#def ine 


DECLARE 


r7 


rl6 


#def ine 


DECLARE 


r7 


rl7 


ttdefine 


DECLARE 


r7 


rl8 


#define 


DECLARE 


r7 


r!9 


#define 


DECLARE 


r7 


r20 


#def ine 


DECLARE 


r7 


r21 


#def ine 


DECLARE 


r7 


r22 


#def ine 


DECLARE 


r7 


r23 


#define 


DECLARE 


r7 


r24 


#define 


DECLARE 


r7 


r25 


#define 


DECLARE 


r7 


r26 


#def ine 


DECLARE 


r7 


r27 


#def ine 


DECLARE 


r7 


r28 


#def ine 


DECLARE 


r7 


r29 


#def ine 


DECLARE 


r7 


r30 


#def ine 


DECLARE 


r7 


r31 


/* 









* r8 declare set 
*/ 



#define 


DECLARE 


r8 




#def ine 


DECLARE 


r8 


r9 


#def ine 


DECLARE 


r8 


rlO 


#def ine 


DECLARE 


r8 


rll 


#def ine 


DECLARE 


r8 


rl2 


#def ine 


DECLARE 


r8 


rl3 


fcdefine 


DECLARE 


r8 


rl4 


#def ine 


DECLARE 


r8 


rl5 


#def ine 


DECLARE 


r8 


rl6 


#def ine 


DECLARE 


r8 


rl7 


#def ine 


DECLARE 


r8 


rl8 


#def ine 


DECLARE 


r8 


rl9 


#def ine 


DECLARE 


r8 


r20 


#def ine 


DECLARE 


r8 


r21 


#def ine 


DECLARE 


r8 


r22 


#def ine 


DECLARE 


r8 


r23 


#def ine 


DECLARE 


r8 


r24 


#def ine 


DECLARE 


r8 


r25 


#define 


DECLARE 


r8 


r26 


#def ine 


DECLARE 


r8 


r27 


#def ine 


DECLARE 


r8 


r28 


#def ine 


DECLARE 


r8 


r29 


#def ine 


DECLARE 


r8 


r30 


#def ine 


DECLARE 


r8 


r31 


/* 









* r9 declare set 
*/ 

#define DECLARE r9 
#define DECLARE r9 rlO 
#define DECLARE r9 rll 
#define DECLARE r9 rl2 
#define DECLARE r9 rl3 
#define DECLARE r9 rl4 
#define DECLARE r9 rl5 
#define DECLARE r9 rl6 
#define DECLARE r9 rl7 
#define DECLARE r9 rl8 
#define DECLARE r9 rl9 
#define DECLARE r9 r20 
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#define DECLARE r9 r21 

#define DECLARE r9 r22 

#define DECLARE r9 r23 

ftdefine DECLARE r9 r24 

#define DECLARE r9 r25 

#define DECLARE r9 r26 

#define DECLARE r9 r27 

#define DECLARE r9 r28 

#define DECLARE r9 r29 

#define DECLARE r9 r30 
#define DECLARE_r9_r31 

/* 

* rlO declare set 
*/ 

#define DECLARE rlO 
#define DECLARE rlO rll 
#define DECLARE rlO rl2 
#define DECLARE rlO rl3 
#define DECLARE rlO rl4 
#define DECLARE rlO rl5 
#define DECLARE rlO rl6 
tfdefine DECLARE rlO rl7 
#define DECLARE rlO r!8 
#define DECLARE rlO rl9 
#define DECLARE rlO r20 
#define DECLARE rlO r21 
#define DECLARE rlO r22 
#define DECLARE rlO r23 
#define DECLARE rlO r24 
#def ine DECLARE rlO r25 
#define DECLARE rlO r26 
#define DECLARE rlO r27 
ftdefine DECLARE rlO r28 
#define DECLARE rlO r29 
#define DECLARE rlO r30 
#define DECLARE_rlO_r31 

/* 

* rll declare set 
*/ 

#define DECLARE rll 
#define DECLARE rll rl2 
#define DECLARE rll r!3 
#define DECLARE rll rl4 
#define DECLARE rll rlS 
#define DECLARE rll rl6 
#define DECLARE rll rl7 
ttdefine DECLARE rll rl8 
#define DECLARE rll rl9 
#define DECLARE rll r20 
ttdefine DECLARE rll r21 
#define DECLARE rll r22 
^define DECLARE rll r23 
#define DECLARE rll r24 
#def ine DECLARE rll r25 
#define DECLARE rll r26 
#define DECLARE rll r27 
#define DECLARE rll r28 
#define DECLARE rll r29 
#define DECLARE rll r30 
#define DECLARE_rll_r31 

/* 

* rl2 declare set 
*/ 

#define DECLARE rl2 
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#define DECLARE rl2 rl3 

tfdefine DECLARE rl2 rl4 

tfdefine DECLARE rl2 rl5 

#define DECLARE rl2 rl6 

#define DECLARE rl2 rl7 

#define DECLARE rl2 rl8 

#define DECLARE rl2 rl9 

#define DECLARE rl2 r20 

#define DECLARE rl2 r21 

#define DECLARE rl2 r22 

#define DECLARE rl2 r23 

tfdefine DECLARE rl2 r24 

#define DECLARE rl2 r25 

#define DECLARE rl2 r26 

ttdefine DECLARE rl2 r27 

#define DECLARE rl2 r28 

#define DECLARE r!2 r29 

#define DECLARE rl2 r30 

#define DECLARE_rl2_r31 

/* 

* rl3 declare set 
*/ 

ttdefine DECLARE rl3 

#define DECLARE rl3 rl4 

#define DECLARE rl3 rl5 

#define DECLARE rl3 rl6 

#define DECLARE rl3 rl7 

#define DECLARE rl3 rl8 

#define DECLARE rl3 r!9 

#define DECLARE rl3 r20 

#def ine DECLARE rl3 r21 

#define DECLARE rl3 r22 

#define DECLARE rl3 r23 

#define DECLARE rl3 r24 

#define DECLARE rl3 r25 

#define DECLARE rl3 r26 

#define DECLARE rl3 r27 

#define DECLARE rl3 r28 

#define DECLARE rl3 r29 

#define DECLARE rl3 r30 
#define DECLARE_rl3_r31 

/* 

* rl4 declare set 

*/. . 
#def ine DECLARE rl4 
#define DECLARE rl4 rl5 
#define DECLARE rl4 rl6 
#define DECLARE rl4 rl7 
#define DECLARE rl4 rl8 
#define DECLARE rl4 rl9 
#define DECLARE rl4 r20 
#define DECLARE rl4 r21 
#define DECLARE rl4 r22 
#define DECLARE rl4 r23 
#define DECLARE rl4 r24 
#define DECLARE rl4 r25 
#define DECLARE rl4 r26 
#define DECLARE rl4 r27 
#define DECLARE rl4 r28 
#define DECLARE rl4 r29 
#define DECLARE rl4 r30 
#define DECLARE_rl4_r31 

/* 

* rl5 declare set 
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*/ 

#define DECLARE rl5 
#define DECLARE rl5 rl6 
#define DECLARE rl5 rl7 
#define DECLARE rl5 rl8 
#define DECLARE rl5 rl9 
#define DECLARE rl5 r20 
#define DECLARE rl5 r21 
#define DECLARE rl5 r22 
#define DECLARE rl5 r23 
#define DECLARE rl5 r24 
#define DECLARE rl5 r25 
#define DECLARE rl5 r26 
#define DECLARE rl5 r27 
#define DECLARE rl5 r28 
#define DECLARE rl5 r29 
#define DECLARE rl5 r30 
#define DECLARE_r 1 5_r 3 1 

'/* 

* rl6 declare set 
*/ 

^define DECLARE rl6 
#define DECLARE rl6 rl7 
#define DECLARE rl6 rl8 
#define DECLARE rl6 rl9 
#define DECLARE rl6 r20 
#define DECLARE rl6 r21 
#define DECLARE rl6 r22 
#define DECLARE rl6 r23 
#define DECLARE rl6 r24 
#define DECLARE rl6 r25 
#define DECLARE rl6 r26 
#define DECLARE rl6 r27 
#define DECLARE r!6 r28 
#define DECLARE rl6 r29 
#define DECLARE rl6 r30 
#define DECLARE_rl6_r31 

/* 

* rl7 declare set 

V 

#define DECLARE rl7 
#define DECLARE rl7 rl8 
#define DECLARE rl7 rl9 
#define DECLARE rl7 r20 
#define DECLARE rl7 r21 
ttdefine DECLARE rl7 r22 
#define DECLARE rl7 r23 
#define DECLARE rl7 r24 
#define DECLARE rl7 r25 
#define DECLARE rl7 r26 
#define DECLARE rl7 r27 
#define DECLARE rl7 r28 
ttdefine DECLARE rl7 r29 
#define DECLARE rl7 r30 
#define DECLARE_rl7_r31 

/* 

* rl8 declare set 
*/ 

#define DECLARE rl8 
#define DECLARE rl8 rl9 
#define DECLARE rl8 r20 
#define DECLARE rl8 r21 
#define DECLARE rl8 r22 
#define DECLARE rl8 r23 
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#def ine 


DECLARE 


rl8 


r24 


#def ine 


DECLARE 


r!8 


r25 


#define 


DECLARE 


rl8 


r26 


ttdefine 


DECLARE 


rl8 


r27 


#define 


DECLARE 


rl8 


r28 


#def ine 


DECLARE 


rl8 


r29 


#define 


DECLARE 


rl8 


r30 


#def ine 


DECLARE 


rl8 


r31 


/* 









* rl9 


declare 


set 




*/ 
#define 


DECLARE 


rl9 




#define 


DECLARE 


r!9 


r20 


#def ine 


DECLARE 


rl9 


r21 


#define 


DECLARE 


rl9 


r22 


#define 


DECLARE 


rl9 


r23 


#def ine 


DECLARE 


rl9 


r24 


#define 


DECLARE 


rl9 


r25 


#def ine 


DECLARE 


rl9 


r26 


#def ine 


DECLARE 


rl9 


r27 


#define 


DECLARE 


rl9 


r28 


#def ine 


DECLARE 


rl9 


r29 


#def ine 


DECLARE 


rl9 


r30 


#def ine 


DECLARE 


rl9 


r31 



FPR single precision declare set 



tfdefine 


DECLARE 


f 0 




#def ine 


DECLARE 


fO 


fl 


#def ine 


DECLARE 


fO 


f2 


#def ine 


DECLARE 


fO 


f3 


#def ine 


DECLARE 


fO 


f4 


#define 


DECLARE 


fO 


f5 


#define 


DECLARE 


fO 


f6 


#def ine 


DECLARE 


fO 


f7 


#def ine 


DECLARE 


fO 


f8 


#def ine 


DECLARE 


fO 


f9 


# define 


DECLARE 


fO 


flO 


#def ine 


DECLARE 


fO 


fll 


#def ine 


DECLARE 


fO 


fl2 


# define 


DECLARE 


fO 


fl3 


#define 


DECLARE 


fO 


fl4 


#define 


DECLARE 


f 0 


fl5 


#def ine 


DECLARE 


fO 


fl6 


#define 


DECLARE 


f 0 


fl7 


#define 


DECLARE 


fO 


fl8 


#def ine 


DECLARE 


fO 


fl9 


tfdefine 


DECLARE 


fO 


f20 


#define 


DECLARE 


fO 


f21 


#def ine 


DECLARE 


fO 


f22 


#def ine 


DECLARE 


fO 


f23 


#define 


DECLARE 


fO 


f24 


#def ine 


DECLARE 


fO 


f25 


#def ine 


DECLARE 


fO 


f26 


#def ine 


DECLARE 


fO 


f27 


#def ine 


DECLARE 


fO 


f28 


#define 


DECLARE 


fO 


f29 


#def ine 


DECLARE 


fO 


f30 


#def ine 


DECLARE 


fO 


f31 


/* 









* FPR double precision declare set 
*/ 

#def ine DECLARE dO 
#def ine DECLARE dO dl 
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#define DECLARE do d2 
#def ine DECLARE do d3 
#def ine DECLARE do d4 
#define DECLARE do d5 
#define DECLARE do d6 
#define DECLARE do d7 
#define DECLARE do d8 
#define DECLARE do d9 
#define DECLARE do dlO 
#define DECLARE do dll 
#define DECLARE do dl2 
#define DECLARE do dl3 
#define DECLARE do dl4 
ttdefine DECLARE do dl5 
#define DECLARE do dl6 
#define DECLARE do dl7 
#define DECLARE do dl8 
#define DECLARE do dl9 
#define DECLARE do d20 
#define DECLARE do d21 
#define DECLARE do d22 
tfdefine DECLARE dO d23 
#define DECLARE dO d24 
#define DECLARE dO d25 
#define DECLARE dO d26 
fldefine DECLARE dO d27 
#define DECLARE do d28 
#define DECLARE dO d29 
#define DECLARE dO d30 
#define DECLARE_dO_d31 

/* 

* VMX declare set 
*/ 

#define DECLARE vO 
#define DECLARE vO vl 
#def ine DECLARE vO v2 
#define DECLARE vO v3 
#define DECLARE vO v4 
#def ine DECLARE vO v5 
#def ine DECLARE vO v6 
#define DECLARE vO v7 
#define DECLARE vO v8 
#def ine DECLARE vO v9 
#define DECLARE vO vlO 
#define DECLARE vO vll 
#define DECLARE vO vl2 
#define DECLARE vO vl3 
#define DECLARE vO vl4 
#define DECLARE vO vl5 
#define DECLARE vO vl6 
^define DECLARE vO vl7 
#define DECLARE vO v!8 
#define DECLARE vO vl9 
#define DECLARE vO v20 
#define DECLARE vO v21 
#define DECLARE vO v22 
#define DECLARE vO v23 
ftdefine DECLARE vO v24 
#define DECLARE vO v25 
#define DECLARE vO v26 
#define DECLARE vO v27 
#define DECLARE vO v28 
#define DECLARE vO v29 
#define DECLARE vO v30 
#define DECLARE vO v31 
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#endif /* end SALPPC_INC */ 

/* • 

* 

* 

* END OF FILE salppc.inc 

* 

* 

*/ 
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■- MC Standard Algorithms PPC Macro language Version 



File Name: SVE3 8BIT.MAC 

Description: Sum the elements of 3 signed byte vectors 
each of length N. 

sve3_8bit ( char +A, char *B, char *C, long *SUM, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



2/23/2001 



Revision 
0.0 



#include "salppc.inc" 
/** 

Input parameters 
**/ 



#def ine 


A 


r3 


#define 


B 


r4 


#define 


C 


r5 


#def ine 


SUM 


r6 


#def ine 


N 


r7 


#def ine 


AOp 


A 


#define 


BOp 


B 


#def ine 


C0p 


C 


#define 


Alp 


r8 


#define 


Blp 


r9 


#def ine 


Clp 


rlO 


#def ine 


index 


rll 


#define 


zero 


vO 


#def ine 


one 


vl 


#def ine 


aO 


v2 


#define 


al 


v3 


#def ine 


bO 


v4 


#def ine 


bl 


v5 


#def ine 


cO 


v6 


#def ine 


cl 


v7 


#def ine 


sumO 


v8 


#def ine 


suml 


v9 


#def ine 


sum2 


V10 



Engineer Reason 
fpl Created 



Date 
000605 

1 */ 



FUNC_PROLOG 

ENTRY_5{ sve3_8bit, A, B, C, SUM, N ) 

USE_THRU_vl0 ( VRSAVE_COND ) 

LI{ index, 0 ) 

VX0R( zero, zero, zero ) 
ADDIC C( N, N, -32 ) 
LVX( aO, AOp, index ) 

VSPLTISB( one, 1 ) 
LVX( bO, BOp, index ) 
ADDI { Alp, AOp, 16 ) 

VXOR{ sumO, sumO, sumO ) 
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ADDI ( Blp, B0p f 16 ) 

VXOR( suml, suml, suml ) 
ADDI { Cip, COp, 16 ) 

VXOR( sum2, sum2, sum2 ) 
BLT ( dol6 ) 

LABEL ( loop ) 

ADDIC C( N, N, -32 ) 
LVX( cO, COp, index ) 

VMSUMMBM( sumO, a0, one, sumO ) 
LVX( al, Alp, index ) 

VMSUMMBM ( suml, bO, one, suml ) 
LVX( bl, Blp, index ) 

VMSUMMBM ( sum2, c0, one, sum2 ) 
LVX( cl, Clp, index ) 
ADDI { index, index, 32 ) 



VMSUMMBM ( 


sumO, 


al, 


one, 


sumO 


) 


LVX( aO, AOp, 


index 


) 






VMSUMMBM ( 


suml, 


bl, 


one, 


suml 


> 


LVX( bO, BOp, 


index 


) 






VMSUMMBM ( 


sum2, 


cl, 


one, 


sum2 


) 


BGE ( loop ) 











CMPWI ( N, -32 ) 
BEQ( combine ) 

LABEL ( dol6 ) 

LVX( cO, COp, index ) 

VMSUMMBM ( sumO, aO, one, sumO ) 

VMSUMMBM ( suml, bO, one, suml ) 

VMSUMMBM ( sum2, cO, one, sum2 ) 

LABEL ( combine ) 

VADDUWM( sumO, sumO, suml ) 
VADDUWM( sumO, sumO, sum2 ) 
VSUMSWS( sumO, sumO, zero ) 
VSPLTW{ sumO, sumO, 3 ) 
STVEWX( sumO, 0, SUM ) 

FREE THRU_vlO ( VRSAVE_COND ) 
RETURN 

FUNC EPILOG 
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__ ***************************************^ 

__******************★******************************★*********** 
_ _** 

--** Majority Voter/Sync Control logic TOP LEVEL Module: voter_sync .vhd 
__** 

--** Description: This Module is the top level of the 
--** Majority Voter and Raceway Sync Logic 
* * 

--** Author : Steven Imper iali 

--** Date : 7-05-2000 

--** Date : 10-25-2000 Modified cable clock and sync 

__** 

«_***************************************^ 

-- This PLD handles the following functions: 

1) Raceway clock source and skew control 

2) Raceway sync generation 

3) Majority voter logic 

4) I2C reset logic 

5) Inverter for the HS LED signal 



LIBRARY IEEE; 

USE IEEE. STD LOGIC 1164 .ALL; 
USE STD . TEXTIO . ALL ; 
use ieee.std logic arith.all; 
use ieee.std_logic_unsigned.all ; 

ENTITY voter_sync IS 
P0RT( 



elk 66 pal6 


IN 


Std 


logic ; 


elk 33 pall 


IN 


std 


logic- 


reset 0 


IN 


std 


logic; 


x rst brd 0 


OUT 


std 


logic- 


x rst brd 1 


OUT 


std 


logic; 


pll rng sel 


OUT 


std 


logic- 


pll freq sel 


:0UT 


std 


logic; 


fb sk sel 


:0UT 


std 


logic ; 


f b dev by 2 0 


OUT 


std 


logic ; 


main sk selO 


OUT 


std 


logics- 


main sk sell 


OUT 


std 


logic; 


jk sk selO 


OUT 


std 


logic- 


jk sk sell 


.OUT 


std 


logic; 


jxl elk oe 


.OUT 


std 


logic ; 


jx2 elk oe 


:OUT 


std 


logic- 


sw elk mode2_l 


•IN 


std 


logic vector (2 downto 1); 


mux elk selO 


OUT 


std 


logic; 


mux_clk_sell 


:OUT 


stdJLogic; 



testn 
tmsO 

rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 

ndO resetreq 0 
ndl resetreq 0 
nd2 resetreq 0 
nd3 resetreq 0 
pq resetreq_0 
resetvote 0 



IN 

IN 

OtJT 

OUT 

OUT 

OUT 

OUT 

OUT 

IN 
IN 
IN 
IN 
IN 
OUT 



std logic; 
std logic; 
std logic; 
std logic; 
std logic ; 
std logic; 
std logic; 
std_logic; 

std logic; 
std logic; 
std logic ; 
std logic; 
std logic; 
std_logic; 



nd0_ckstpreqnd0_0 :IN std_logic; 



489 



WO 02/073937 



PCT/US02/08106 



voter_sync . vhd 



3/9/2001 



ndO ckstpreqndl 
ndO ckstpreqnd2 
ndO ckstpreqnd3 
ndO ckstpreqpq 0 
ndl ckstpreqndO 
ndl ckstpreqndl 
ndl ckstpreqnd2 
ndl ckstpreqnd3 
ndl ckstpreqpq 0 
nd2 ckstpreqndO 
nd2 ckstpreqndl 
nd2 ckstpreqnd2 
nd2 ckstpreqnd3 
nd2 ckstpreqpq 0 
nd3 ckstpreqndO 
nd3 ckstpreqndl 
nd3 ckstpreqnd2 
nd3 ckstpreqnd3 
nd3 ckstpreqpq 0 
pq ckstpreqndO 0 
pq ckstpreqndl 0 
pq ckstpreqnd2 0 
pq ckstpreqnd3 0 
pq ckstpreqpq_0 
pq ckstopin 0 
ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3_ckstopin_0 

i2c_rst_0 
sda 
scl 

pxbO hs_led 
hs_led 
); 

END voter_sync ; 

ARCHITECTURE TOP_LEVEL_voter_sync OF voter_sync IS 

--*************************************************************** 
__********** *************** ****** ********,************************ 
--** Component Declearation 

__********* ****************************************************** 
--********************* ******************** ************** ******** 



0 


:IN 


std 


logic 


0 


:IN 


std 


logic 


0 


:IN 


std 


logic 




;IN 


std 


logic 


o 


:IN 


std 


logic 


o 


: IN 


std 


logic 


o 


: IN 


std 


logic 


o 


: IN 


std 


logic 




: IN 


std 


logic 


o 


: IN 


std 


logic 


o 


: IN 


std 


logic 


o 


;IN 


std 


logic 


o 


: IN 


std 


logic 




:IN 


std 


logic 


o 


: IN 


std 


logic 


o 


:IN 


std 


logic 


0 


:IN 


std 


logic 


0 


:IN 


std 


logic 




:IN 


std 


logic 




:IN 


std 


logic 




IN 


std 


logic 




IN 


std 


logic 




IN 


std 


logic 




IN 


std 


logic 




OUT 


std 


logic 




OUT 


std 


logic 




OUT 


std 


logic 




OUT 


std 


logic 




OUT 


std^logic, 


:IN 


std logic 


:INOUT 


std logic, 


:INOUT 


std logic, 


:IN 


std logic, 


:OUT 


std_logic 



COMPONENT m voter PORT( 



elk 66 pal6 


:IN 


std logic ; 


reset 0 


:IN 


std logic; 


requestO 0 


- IN 


std logic; 


request 1 0 


IN 


std logic; 
std logic; 


request2 0 


•IN 


request3 0 


:IN 


std logic; 


reguest4 0 


IN 


std logic; 


heal thy 0 1 


IN 


std logic; 


healthyl 1 


IN 


std logic; 


healthy2 1 


IN 


std logic; 


healthy3 1 


IN 


std logic ; 


healthy4 1 : 


IN 


std logic; 


voteout_0 : 


OUT 


std_logic) ; 



.************ *************************************************** 
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--** Signals to Connect All of the Components Together 
__*************★******************* *******^ 



Signal healthyO 1 :std logic 
Signal healthyl 1 :std logic 
Signal healthy2 1 :std logic 
Signal healthy3 1 :std logic 
Signal healthy4_l :std logic 
Signal sync dl :std logic 
Signal sync d2 :std logic 
Signal sync d3 :std logic 
Signal ndO ckstop_0, ndl_ckstop_0 , 
: std logic- 
Signal g ndO resetreq 0 :std logic; 
Signal g ndl resetreq 0 :std logic; 
Signal g nd2 resetreq 0 :std logic- 
Signal g_nd3_resetreq__0 :std_logic; 



nd2_ckstop_0 , nd3_ckstop_0 , pq_ckstop_0 



BEGIN 



--** Begin Architecture Here (Instantiations) 
--★♦♦a****************************************** 



nd0_ckstop voter : m_voter PORT Map( 
elk 66 pal 6, 
reset 0, 

ndO ckstpreqndO 0, 
ndl ckstpreqndO 0, 
nd2 ckstpreqndO 0, 
nd3 ckstpreqndO 0, 
pq ckstpreqndO_0 , 
healthyO 1, 
healthyl 1, 
healthy2 1, 
. healthy3 1, 
healthy4 1, 
nd0_ckstop_0) ; 



ndl_ckstop voter : m_jvoter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndl 0, 
ndl ckstpreqndl 0, 
nd2 ckstpreqndl 0, 
nd3 ckstpreqndl 0, 
pq ckstpreqndlj), 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
heal thy 4 1, 
ndl_ckstop_0) ; 



nd2_ckstop voter : m_yoter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd2 0 f 
ndl ckstpreqnd2 0 f 
nd2_ckstpreqnd2_0 , 
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nd3 ckstpreqnd2 0, 
, pq ckstpreqnd2_0, 
heal thy 0 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd2_ckstop_0) ; 



nd3_ckstop voter : m_voter PORT. Map ( 
elk 66 pa!6, 
reset 0, 

ndO ckstpreqnd3 0, 
ndl ckstpreqnd3 0, 
nd2 ckstpreqnd3 0, 
nd3 ckstpreqnd3 0, 
pq ckstpreqnd3_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd3_ckstop__0) ; 



pq_ckstop voter : m voter PORT Map( 
elk 66 pal6, 
reset 0 # 

ndO ckstpreqpq 0, 
ndl ckstpreqpq 0, 
nd2 ckstpreqpq 0, 
nd3 ckstpreqpq 0, 
pq ckstpreqpq_0 , 
healthyO 1, 
healthyl 1, 
heal thy 2 1, 
healthy3 1, 
healthy4 1, 
pq_ckstop_0) ; 



- - ########################################### ######### 

this section was added to force a board level reset when 
the 8240 has a watchdog failure. 

-- this should have been done by feeding the 8240 's WDFAIL 
--to the reset PLD instead of forcing the 8240' s resetreq 
--to drive all other resetrequests . 

g ndO resetreq 0 <= ndO resetreq 0 AND pq resetreq 0; 
g ndl resetreq 0 <= ndl resetreq 0 AND pq resetreq 0 ; 
g nd2 resetreq 0 <= nd2 resetreq 0 AND pq resetreq 0 ; 
9_nd3_resetreq_0 <= nd3_resetreq_0 AND pq_resetreq_0 ; 

- - #################################### ################# 



reset_req voter : m voter PORT Map( 
elk 66 pal 6, 
reset 0, 

g ndO resetreq 0 # 
g ndl resetreq 0, 
g nd2 resetreq 0, 
g_nd3_resetreq_0 , 
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pq resetreq_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
heal thy 4 1 , 
resetvote_0) ; 



healthyO 1 <= ndO ckstop 0; 
healthyl 1 <= ndl ckstop ,0; 
healthy2 1 <= nd2 ckstop 0; 
healthy3 1 <= nd3 ckstop 0; 
healthy4_l <= pq_ckstop_0; 



ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3 ckstopin 0 
pq_ckstopin_0 



<s ndO ckstop 0; 
<= ndl ckstop 0; 
<= nd2 ckstop 0; 
<= nd3 ckstop 0; 
<= pq_ckstop__0 ; 



WITH i2c_rst_0 SELECT 

Sda <= clk_33_j>all WHEN '0', 
'Z' ^ WHEN '1', 

»Z' WHEN OTHERS; 

WITH i2c_rst_0 SELECT 

scl <= clk_33_pall WHEN '0', 
'Z' WHEN «!', 

'Z' WHEN OTHERS; 



hs_led <= NOT(pxbOJisJLed) ; 



Sync Control 
process (clk_66_pal 6, reset_0) 
BEGIN . 

IF (reset 0 = '0') THEN 



sync dl 




<- 


'1 


sync d2 




<= 


'1 


sync d3 




<= 


»1 


rsync x 


ndO 


<= 


'0 


rsync x 


ndl 


<= 


'0 


rsync x 


nd2 


<= 


•0 


rsync x 


nd3 


<= 


'0 


rsync x 


pxbO 


<= 


«0 


rsync_x_ 


_xbar 


<= 


'0 



ELS IF (testn = '0' AND reset 0 = THEN 

rsync x ndO <= tmsO; 

rsync x ndl <= tmsO; 

rsync x nd2 <= tmsO; 

rsync x nd3 <= tmsO; 

rsync x pxbO <= ' 0 1 ; 

rsync_x_xbar <= ' 0 » ; 

ELS IF rising edge (elk 66 pal6) THEN 
sync dl <= NOT(sync dl) ; 

sync_d2 <= (NOT ( sync j52) AND sync_dl OR sync_d2 AND 
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NOT ( sync dl) ) 

sync d3 <= (NOT (NOT (sync dl) AND sync_d2)); 



rsync 


X 


ndO 


<= 


sync 


d3 


rsync 


X 


ndl 


< = 


sync 


d3 


rsync 


X 


nd2 


< = 


sync 


d3 


rsync 


X 


nd3 


<= 


sync 


d3 


rsync 


X 


pxbO 


<= 


sync 


d3 


rsync_ 


X 


xbar 


<= 


sync_ 


d3 



END IF; 
END process; 



x rst brd 0 <= reset 0; 
x_rst_brd_l <= NOT (reset_0) ; 



WITH sw elk mode2 1 SELECT 
mux elk selO <= ' 0' 



WHEN "00", -- 66MHz local 

'0» WHEN "01% 33MHz cable 1 

'1' WHEN "10", -- 33MHz cable 2 

'0' WHEN "11" , - 66 MHz local 

•1' WHEN OTHERS; 



WITH sw elk tnode2 1 SELECT 

mux_clk_sell <= ' 0 1 WHEN "00", 

•l 1 WHEN "01", 

'I 1 WHEN "10", 

'0' WHEN "11 ", 

r l* WHEN OTHERS; 

WITH sw elk mode2 1 SELECT 

f b_de v_by_2 _0 <= » 0' WHEN "00", 

'Z' WHEN "01", 

'Z' WHEN "10", 

•0' WHEN "11", 

»1« WHEN OTHERS; 

WITH sw clk_mode2 1 SELECT 
jxl_clk_oe 



jx2_clk_oe 



<= «1« 




WHEN 


"00", 




! 1 T 


WHEN 


"01", 


»1' 




WHEN 


"10", 




'l 1 


WHEN 


"11", 


•I 1 




WHEN 


OTHERS ; 


mode2 1 SELECT 








<= »1' 




WHEN 


"00" , 




'1» 


WHEN 


"01", 


'1' 




WHEN 


"10", 




■1' 


WHEN 


"11", 






WHEN 


OTHERS; 



WITH sw clk_mode2 1 SELECT 

pll_rng_sel <= 'i» WHEN "00", 

? 1' WHEN "01", 

•1» WHEN "10", 

'1' WHEN "11", 

'1' WHEN OTHERS ; 

WITH sw elk mode2 1 SELECT 

pll_freq_sel <= 'Z' WHEN "00", 

•0' WHEN "01", 

•0' WHEN "10", 

*Z' WHEN "11", 
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'I 1 WHEN OTHERS; 



select 0 skew for all modes 



WITH sw elk mode 2 1 SELECT 





WHEN 


"00", 






•Z' 


WHEN 


"01", 


'Z' 


WHEN 


"10", 






•Z' 


WHEN 


"11", 


•1' 


WHEN 


OTHERS ; 





WITH sw_clk mode2 1 SELECT 

main_sk_sel0 <= f Z' WHEN "00", 



•Z r WHEN "01", 

WHEN "10", 
l Z' WHEN "11", 

WHEN OTHERS ; 



WITH sw_clk mode2 1 SELECT 

main__sk_sell <= 'Z' WHEN "00", 



'Z' WHEN "01", 

WHEN "10", 
«Z' WHEN "11", 

WHEN OTHERS; 



WITH sw_clk mode2 1 SELECT 
jk_sk_sel0 <= 



Z' 


WHEN 


"00", 






'Z' 


WHEN 


"01", 


Z 1 


WHEN 


"10", 






'Z' 


WHEN 


"11", 


1' 


WHEN 


OTHERS ; 





WITH sw_clk mode2 1 SELECT 



Z' 


WHEN 


"00", 






'Z' 


WHEN 


"01", 


Z' 


WHEN 


"10", 






'Z' 


WHEN 


"11", 


1' 


WHEN 


OTHERS; 





END TOP_LEVEL_voter_sync ; 
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MC Standard Algorithms PPC Macro language Version 



File Name: ZDOTPR4 VMX.K 

Description: CPP Source code for Vector Single Precision 
Split Complex Dot Product given that input 
vectors are relivatively unaligned. 

Entry/params : ZDOTPR4 VMX (A, I, B, J, C, N) 
_ZID0TPR4_VMX (A, I, B # J, C, N) 

Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

-/+ A->imagp[ml3 *B- >imagp [mJ] ) 
C[l] = sum (A->realp[mI]*B->imagp[mJ3 

+/- A->imagp[mI] *B- >realp [mJ] ) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000608 fpl Created (from zdotpr vmx.k) 

+ 

#include " salppc . inc " 
/** 

ESAL CPP definitions 
** i 

#undef FUNC ENTRY 
#undef LOAD A 
#undef LOAD B 
#undef SUFFIX 



#if defined ( VMX_SAL ) 

#define FUNC ENTRY zdotpr 4 vmx 

#define FUNC CONJ ENTRY _zidotpr4 vmx 
ttdefine LOAD A( vT f rA # rB ) LVX( vT, rA, rB ) 
#define LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 
ttdefine SUFFIX ( label ) label 

ttelif defined ( VMXJtfN ) 

ttdefine FUNC ENTRY zdotpr4 vmx nn 

ttdefine FUNC CONJ ENTRY _zidotpr4_vmx_nn 
#define LOAD A{ vT, rA, rB ) LVXL< vT, rA, rB ) 
ttdefine LOAD B{ vT, rA, rB ) LVXL{ vT, rA, rB ) 
#define SUFFIX ( label ) label##_nn 

ttelif defined ( VMX_NC ) 

ttdefine FUNC ENTRY zdotpr4 vmx nc 

ttdefine FUNC CONJ ENTRY _zidotpr4_vmx_nc 
ttdefine LOAD A( vT, rA, rB ) LVXL( vT, rA, rB ) 
ttdefine LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 
ttdefine SUFFIX ( label ) label##_nc 

ttelif defined { VMX_CN ) 

ttdefine FUNC ENTRY zdotpr4 vmx cn 

ttdefine FUNC CONJ ENTRY _zidotpr4 vmx cn 
ttdefine LOAD A( vT, rA, rB ) LVX( vT, rA, rB ) 
ttdefine LOAD B( vT, rA, rB ) LVXL( vT, rA, rB ) 
ttdefine SUFFIX ( label ) labeltttt_cn 

ttelif defined ( VMX_CC ) 
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#define 



FUNC 



ENTRY 



zdotpr4 vmx cc 



#define FUNC CON J ENTRY _zidotpr4 vmx cc 

#define LOAD A( vT, rA, rB ) LVX( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 

#define SUFFIX ( label ) label##_cc 

#else 

#error YOU MUST DEFINE VMX_xxx, where x = C or N 



#define VREGSAVE__COND VRSAVE__COND /* defined as 7 in salppc.inc */ 



Local CPP definitions 
** I 

#define NMASK2 0x8 

#define NMASK1 0x4 

#define NSHIFT 4 

#define ADDRESS_INCREMENT 16 

/** 

Input args 
**/ 

#def ine A r3 
#define I r4 
#define B r5 
#define J r6 
#define C r7 
#define N r8 
#define EFLAG r9 

/** 

Split complex parameters 
**/ 



#def ine 


ArO 


A 


#def ine 


AiO 


no 


#def ine 


BrO 


B 


#def ine 


BiO 


rll 


#def ine 


Cr 


C 


#define 


Ci 


rl2 


/** 






Local registers 








#def ine 


count r4 


#define 


rtmpO r4 


#def ine 


rtmpl rl3 


#def ine 


Arl 


rl3 


#def ine 


Ail 


rl4 


#define 


Ar2 


rl5 


#define 


Ai2 


rl6 


#def ine 


Ar3 


rl7 


#def ine 


Ai3 


rl8 


#def ine 


Brl 


rl9 


#def ine 


Bil 


r20 


#def ine 


Br2 


r21 


#def ine 


Bi2 


r22 


#define 


Br3 


r23 


#def ine 


Bi3 


r24 


#define 


aoffset r25 


#def ine 


coffset r25 


#define 


boffset r26 


#define 


addr incr r27 



#endif 



/** 
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/** 

VMX registers 

#define rsumr vO 
#define rsumi vl 
#define isurar v2 
^define isutni v3 
^define rsumO v4 
#define rsuml v5 
#define isumO v6 
#define isuml v7 

#define arO v4 
#def ine aiO v5 " 
#define arl v6 
#define ail v7 
#define ar2 v8 
#define ai2 v9 
#define ar3 vlO 
#define ai3 vll 

#define brO vl2 
#define biO vl3 
#define brl vl4 
#define bil vl5 
#define br2 vl6 
#define bi2 vl7 
#define br3 vl8 
#define bi3 vl9 
#define apC v20 

#define atrO v21 
#define atiO v22 
#define atrl v23 
#define atil v24 
#define atr2 v25 
#define ati2 v26 
#define atr3 v27 
#define ati3 v28 



/** 

FPU registers 
**/ 



#define 


far 


fO 


#def ine 


fbr . 


fl 


#define 


fai 


f2 


#define 


fbi 


f3 


#def ine 


frsumr 


f4 


#def ine 


frsumi 


f5 


#def ine 


fisumi 


f6 


#define 


fisumr 


f7 


#define 


frsum 


f8 


#def ine 


fisum 


f9 


#def ine 


rsum vmx flO 


ttdefine 


isum vmx fll 



Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U_ENTRY( FUNC CONJ — ENTRY ) 
MR(rtmpO, Cr) 
MR{Cr f Ci) 
MR(Ci f rtmpO) 
MR(rtmpO, BrO) 
MR {BrO, BiO) 
MR (BiO, rtmpO) 
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Here for normal inner product 
** i 

FUNC PROLOG 

U_ENTRY ( FUNC ENTRY ) 

DECLARE fO fll 

DECLARE r3 r27 

DECLARE vO V28 

/** " ~ 

Initial setup code 

**/ 

SAVE rl3 r27 

USE THRU v28 ( VREGSAVE_COND ) 
LFS( frsumr, ArO, 0 ) 
FSUBS( frsumr, frsumr, frsumr) 
FMR ( f r sumi , f r sumr ) 
FMR(f isumr, frsumr) 
FMR ( f i sumi , f r sumr ) 
FMR ( rsum vmx, frsumr) 
FMR(isum_vmx, frsumr) 

/** 

Process unaligned vector section first 
**/ 

LABEL ( SUFFIX ( cont ) ) 

GETJVMX UNALIGNED_COUNT ( count, BrO ) 

LI( aoffset, 0 ) 

LI( boffset, 0 ) 

BEQ ( SUFFIX ( aligned ) ) 
^SUB( N, N, count ) /* adjust N for after loop */ 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 

LFSX( far, ArO, aoffset ) 
LFSX( fai, AiO, aoffset ) 
DECR C( count ) 
LFSX< fbr, BrO, boffset ) 
LFSX( fbi, BiO, boffset ) 
FMULS( frsumr, far, fbr ) 
FMULS( frsumi, fai, fbi ) 
FMULS( fisumi, far, fbi ) 
FMULS( f isumr, fai, fbr ) 
ADDI ( ArO, ArO, 4 ) 
ADDI ( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 

BEQ ( SUFFIX ( aligned ) ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX { pre_loop ) ) 
LFSX( far, ArO, aoffset ) 
LFSX( fai, AiO, aoffset ) 
DECR C( count ) 
LFSX( fbr, BrO, boffset ) 
LFSX( fbi, BiO, boffset ) 
FMADDS ( frsumr, far, fbr, frsumr ) 



ADDI ( ArO, ArO, 
FMADDS ( frsumi, 
ADDI ( AiO, AiO, 
FMADDS ( fisumi, 
ADDI ( BrO, BrO, 



4 ) 

fai, fbi, frsumi ) 
4 ) 

far, fbi, fisumi ) 
4 ) 



FMADDS ( f isumr, fai, fbr, f isumr ) 
ADDI ( BiO, BiO, 4 ) 
BNE ( SUFFIX ( preJLoop) ) 

Here for VMX aligned loop code 
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Prepare for loop entry: assign loop pointers, counters 
** / 

LABEL ( SUFFIX ( aligned ) ) 

SRWI C( count, N, 4 ) /* 16 per trip */ 
LVSL< apC, ArO, aoffset.) 
LI ( aoffset, 0 ) 
LI( boffset, 0 ) 



ADDI ( Arl, ArO, 16 ) 
VXOR( rsumr, rsumr, rsumr ) 
ADDI { Ar2, ArO, 32 ) 
ADDI ( Ar3, ArO, 48 ) 

ADDI ( Ail, AiO, 16 ) 
VXOR( isurai, isumi, isumi ) 
ADDI ( Ai2, AiO, 32 ) 
ADDI ( Ai3, AiO, 48 ) 

ADDI ( Brl, BrO, 16 ) 
VXOR( rsumi, rsumi, rsumi ) 
ADDI ( Br2, BrO, 32 ) 
ADDI ( Br3, BrO, 48 ) 

ADDI ( Bil, BiO, 16 ) 
ADDI { Bi2, BiO, 32 ) 
VXOR( isumr, isumr, isumr ) 
ADDI { Bi3, BiO, 48' ) 
BEQ( SUFFIX (two left) ) 

/** 

Loop windin section 
** / 

LOAD A( atrO, ArO, aoffset ) 
LOAD A( atiO, AiO , aoffset ) 
LOAD A( atrl, Arl, aoffset ) 
LOAD_A( atil, Ail, aoffset ) 



LOAD A( atr2, Ar2, aoffset ) 
LOAD A( ati2, Ai2, aoffset ) 
VPERM( arO, atrO, atrl, apC ) - 
LOAD B( brO, BrO, boffset ) 
LOAD B( bio, BiO, boffset ) 
DECR C( count ) 
VPERM( aiO, atiO, atil, apC ) 
LOAD B( brl, Brl, boffset ) 
VPERM{ arl, atrl, atr2, apC ) 
LOAD A( atr3, Ar3, aoffset ) 
BR( SUFFIX ( mid_loop ) ) 

/** 

Top of vector loop 
LABEL ( SUFFIX ( loop ) ) 

/* { */ 

LOAD A( atr2, Ar2, aoffset ) 
VMADDFP ( rsumr, ar3, br3, rsumr ) 
LOAD A{ ati2, Ai2, .aoffset ) 

VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 

VMADDFP ( rsumi, ai3 , bi3, rsumi ) 

LOAD B( brO, BrO, boffset ) 

LOAD B{ bio, BiO, boffset ) 

DECR C( count ) 

VPERM( aiO, atiO, atil, apC ) 

LOAD B( brl, Brl, boffset ) 

VPERM( arl, atrl, atr2, apC ) 

VMADDFP { isumi, ar3, bi3, isumi ) 

LOAD A( atr3, Ar3, aoffset ) 

VMADDFP ( isumr, ai3, br3, isumr ) 
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Loop entry 
**/ 

LABEL { SUFFIX { mid loop ) ) 

VMADDFP { rsumr, arO, brO, rsumr ) 
VPERM( ail r a til, ati2, apC ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ati3, Ai3, aoffset ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( bil, Bil, boffset ) 
ADDI { aoffset, aoffset, 64 ) 
VPERM( ar2, atr2 , atr3, apC ) 
VMADDFP ( isumi, arO, biO, isumi ) 
LOAD B( br2, Br2, boffset ) 
VMADDFP { rsumr, arl, brl, rsumr ) 
LOAD B{ bi2, Bi2, boffset ) 
VMADDFP ( isumr, ail, brl, isumr ) 

/** 

Loop exit 
**/ 

VPERM( ai2, ati2, ati3, apC ) 
BEQ ( SUFFIX (loop exit ) ) 
LOAD A{ atrO, ArO, aoffset ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( atiO, AiO , aoffset ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( br3, Br3, boffset ) 
VPERM( ar3, atr3 , atrO, apC J 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
LOAD A< atrl, Arl, aoffset ) 
VMADDFP { rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3 , atiO, apC ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
LOAD A( atil, Ail, aoffset ) 
VMADDFP { isumr, ai2, br2, isumr ) 
/* } */ 

BR ( SUFFIX ( loop ) ) 

/** 

windout section 
★ * / 

LABEL ( SUFFIX (loop exit ) ) 

LOAD A( atrO, ArO, aoffset ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( atiO, AiO, aoffset ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( br3, Br3, boffset ) 
VPERM( ar3, atr3, atrO, apC ) 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
VMADDFP ( isumr, ai2, br2, isumr ) 
VMADDFP ( rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3 , rsumi ) 
VMADDFP ( isumi, ar3, bi3, isumi ) 
VMADDFP ( isumr, ai3, br3, isumr ) 

/** 

Remaining sum updates 
+*/ 

LABEL ( SUFFIX (twqJLeft) ) 

ANDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ( SUFFIX (one_l eft ) ) 

LOAD_B( brO, BrO, boffset ) 
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LOAD B( biO, BiO, boffset ) 
LOAD B( brl, Brl, boffset ) 
LOAD B{ bil, Bil, boffset ) 
ADDI ( boffset, boffset, 32 } 

LOAD A( atrO, ArO, aoffset ) 

LOAD A( atiO, AiO, aoffset ) 

LOAD A( atrl, Arl, aoffset ) 

LOAD A{ atil, Ail, aoffset } 

LOAD A( atr2, Ar2, aoffset ) 

LOAD A{ ati2, Ai2, aoffset } 
ADDI ( aoffset, aoffset, 32 ) 



VPERM( arO, atrO, atrl, apC ) /* uses last pass value */ 

VPERM( aiO, atiO, atil, apC ) 

VPERM{ arl, atrl, atr2, apC ) 

VPERM( ail, atil, ati2, apC ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 

VMADDFP ( rsumi, aiO, biO, rsumi ) 

VMADDFP { isumr, aiO, brO, isumr ) 

VMADDFP { isumi, arO, biO, isumi ) 



VMADDFP { rsumr, 
VMADDFP ( isumr, 
VMADDFP ( rsumi, 
VMADDFP ( isumi, 
VMR(atr3, atrl) 
VMR(ati3, atil) 



arl, brl, rsumr ) 

ail, brl, isumr ) 

ail, bil, rsumi ) 

arl, bil, isumi ) 



LABEL ( SUFFIX (one_left) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 
BEQ ( SUFFIX (combine ) ) 



LOAD B( brO, BrO, boffset ) 
LOAD B{ biO, BiO, boffset ) 
ADDI ( boffset, boffset, 16 ) 

LOAD A( atrd, ArO, aoffset ) 
LOAD A( atiO, AiO, aoffset ) 
LOAD A( atrl, Arl, aoffset ) 
LOAD A{ atil, Ail, aoffset ) 
ADDI ( aoffset, aoffset, 16 ) 



VPERM{ arO, atrO, atrl, apC ) /* uses last pass value */ 
VPERM( aiO, atiO, atil, apC ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
VMADDFP ( isumi, arO, biO, isumi ) 

/** 

combine partial sums, permute, write out results 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - rsumi */ 

VADDFP ( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: 

real/imag logic should be intermixed for efficiency 
** J 

VMRGHW ( rsumO , rsumr , rsumr) 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW (isumO, isumi, isumi) 
VMRGLW ( rsumi , rsumr , rsumr) 

SUB( addr incr, N, addr incr ) /* offset index for remainders */ 
VMRGLW (isumi, isumi, isumi) 
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VADDFP ( rsumO, rsuml, rsumO ) 

SLWKaddr incr, addr incr, 2) /* byte offset */ 
VADDFP ( isumO, isuml, isumO ) 

VMRGHW (rsuml , rsumO , rsumO ) 
ADD ( ArO , ArO, addr incr) 
VMRGHW ( i suml , i sumO , i sumO ) 
ADD (AiO , AiO, addr incr) 
VMRGLW (rsumO , rsumO, rsumO) 
ADD (BrO, BrO, addr incr) 
VMRGLW { isumO , isumO , isumO ) 
ADD (BiO , BiO, addr incr) 
VADDFP ( rsumr, rsuml, rsumO ) 
LKcoffset, 0) /* needed for output */ 
VADDFP ( isumi, isuml, isumO ) 
/** 

4 byte stores 
**/ 

STVEWX( rsumr, Cr, coffset ) 
STVEWX( isumi, Ci, coffset ) 
/** 

Remainders of 1-3 more to do 
** / 

ANDI_C( N, N, 3 ) 
LFS( rsum vmx, Cr, 0 ) 
LFS( isum vmx, Ci, 0 ) 
BEQ ( SUFFIX ( scaler_vmx_combine ) ) 
/** 

Here to do last 1-3 points using standard FP 
** J 

LABEL ( SUFFIX ( post_loop ) ) 
LFS( far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_C{ N ) 
LFS( fbr, BrO, 0 ) 
LFS( fbi, BiO, 0 ) 
FMADDS ( f rsumr, far, fbr, f rsumr ) 
FMADDS ( frsumi, fai, fbi, frsumi ) 
FMADDS { f isumi, far, fbi, f isumi ) 
FMADDS ( fisumr, fai, fbr, fisumr ) 
ADDI (ArO, ArO, 4) 
ADDI(BrO, BrO, 4) 
ADDI (AiO, AiO, 4) 
ADDI (BiO, BiO, 4) 
BNE ( SUFFIX { post_loop) ) 

/** 

Write out result 
** J 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS( frsum, f rsumr, frsumi ) /* rsumr = rsumr - rsumi */ 

FADDS ( fisum, f isumi, fisumr ) 

FADDS ( frsum, frsum, rsum vmx ) 

FADDS ( fisum, fisum, isum_vmx ) 

STFS( frsum, Cr, 0 ) 

STFS( fisum, Ci, 0 ) 

/** 

return 
** j 

LABEL ( SUFFIX (ret) ) 

FREE THRU v28 ( VREGSAVE_COND ) 

REST rl3_r27 

RETURN 
FUNC EPILOG 
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MC Standard Algorithms PPC Macro language Version 



File Name: 2DOTPR4 VMX. MAC 

Description: Vector Single Precision Complex Dot Product 

CPP dummy file for unaligned vector processing 

Entry/params : ZDOTPR4 VMX (A, I, B, J, C, N) 
Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

- A->imagp[mI] *B->imagp.[mJ] ) 
CtU = sum (A->realp [ml] *B->imagp[mJ] 

+ A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 



» Mercury Computer Systems, Inc. 

Copyright (c) 1998 All rights reserved 

Revision Date Engineer Reason 



0.0 000607 fpl Created (from zdotpr vmx.mac) 
+ */ 

#if defined (. BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
flundef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if I defined ( COMPILE_ESAL_JUMP TABLE ) 

/* 1 variant: _zdotpr4_vmx { ) */ 

ttdefine VMX SAL 
#include "zdotpr4_vmx.k" 

#else /* 5 variants based on ESAL flag */ 

tfdefine VMX NN 

# include ,f zdotpr4_vrax.k tf 

#undef VMX NN 

#def ine VMX NC 

#include "zdotpr4_vmx.k" 

#undef VMX NC 

#def ine VMX CN 

#include t, zdotpr4_vmx.k" 

#undef VMX CN 
ttdefine VMX CC 
^include M zdotpr4_vmx.k n 
#undef VMX_CC 

#endif /* end COMPI LE_ESAL_JUMP__TABLE */ 

#endif /* end BUILD_MAX */ 
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zdotpr_vmx.k 
/* 



2/23/2001 



MC Standard Algorithms PPC Macro language Version 



Pile Name: ZDOTPR. K 

Description: CPP Source code for Vector Single Precision 

Split Complex Dot Product 
Entry /params : ZDOTPR (A, I, B, J, C, N) 

ZIDOTPR (A, I, B, J, C, N) 

Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

-/+ A->imagp[mI] *B- >imagp [mJ] ) 
C[l] = sum (A->realp [ml] *B->imagp [mJ] 

+/- A->imagp[mI]*B->realp[mJ]) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 

Revision Date Engineer Reason 

0.0 981215 fpl Created 

0.1 990310 fpl Integrated with 750 library 

0.2 000131 jfk salppc.inc changes 

0.3 000223 fpl Fixed pre -loop bug 

0.4 000717 fpl Added dsts, removed LVXLs . 



#include "salppc.inc" 
/** 

ESAL CPP definitions 
**/ 

#undef FUNC CON J ENTRY 

#undef FUNC ENTRY 

#undef LOAD A 

#undef LOAD B 

#undef SUFFIX 

#if defined ( VMXJSAL ) 

#def ine FUNC ENTRY zdotpr vmx 

#def ine FUNC CON J ENTRY _zidotprjvmx 

#define LOAD A( vT, rA, rB ) LVX{ vT, rA # rB ) 

#define LOAD B{ vT, rA, rB ) LVX( vT, rA, rB ) 

ttdefine SUFFIX ( label ) label 

#undef DSTA( ptr, control ) 

ttundef DSTB ( ptr, control ) 

#define DSTA{ ptr, control ) 

#define DSTB { ptr, control ) 

#undef DST_ENABLE 

#elif defined ( VMX_NN ) 

#define FUNC ENTRY zdotpr vmx nn 

#def ine FUNC CONJ ENTRY _zidotpr vmx_nn 

#define LOAD A( vT, rA, rB } LVXfvT, rA, 

#define LOAD B( vT, rA, rB ) LVX( vT, rA, 

ttdefine SUFFIX ( label ) label##_nn 

#undef DSTA( ptr, control ) 

#undef DSTB ( ptr, control ) 

#define DSTA( ptr, control ) 

#define DSTB { ptr, control ) 

#undef DST ENABLE 



-*/ 



rB ) 
rB ) 



#elif defined { VMXJJC ) 
#define FUNC ENTRY 



_z d o t p r_vmx_n c 
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#def ine FUNC CONJ ENTRY _zidotpr_vmx_jic 
#define LOAD A( vT, rA, rB ) LVX( vT, rA, rB ) 
#define LOAD B( vT, rA, rB ) LVX{ vT, rA, rB ) 
#define SUFFIX { label ) label##jic 
#undef DSTA( ptr, control ) 
#undef DSTB ( ptr, control ) 

#define DSTA( ptr, control ) DST( ptr, control, 

ADDI ( ptr, ptr, 64 

#define DSTB ( ptr, control ) 
#def ine DSTJSNABLE 

#elif defined ( VMX_CN ) 

# define FUNC ENTRY zdotpr vmx cn 

tfdefine FUNC CONJ ENTRY _zidotpr_vmx_cn 

#define LOAD A{ vT, rA, rB ) LVX( vT, rA, 

#define LOAD B( vT, rA, rB ) LVX { vT, rA, 

#define SUFFIX ( label ) label##_cn 

#undef DSTA( ptr, control ) 

#undef DSTB ( ptr, control ) 

#define DSTA( ptr, control ) 

#define DSTB ( ptr, control 
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) \ 



rB ) 
rB ) 



) 



#define DST_ENABLE 
#elif defined( VMX_CC ) 



DST{ ptr, control, 0 ) \ 
ADDI ( ptr, ptr, 64 ) 



//#define FUNC ENTRY zdotpr vmx__cc 

#define FUNC ENTRY zdotpr vmx 

#def ine FUNC CONJ ENTRY _zidotpr_ymx_cc 

#define LOAD A( vT, rA, rB ) LVX( vT, rA, rB ) 

#define LOAD B{ vT, rA, rB ) LVX ( vT, rA, rB ) 

#define SUFFIX ( label ) label##_cc 

#undef DSTA ( ptr, control ) 

#undef DSTB { ptr, control ) 

#define DSTA( ptr, control ) 

#define DSTB { ptr, control ) 

tfundef DST ENABLE 



#else 

#error YOU MUST DEFINE VMX_xxx, where x 
#endif 



C or N 



#define VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 



/** 

Local CPP definitions 
**/ 

#define NMASK2 0x8 

#define NMASK1 0x4 

#def ine NSHIFT 4 

#define ADDRESS_INCREMENT 16 

/** 

Input args 
**/ 

#define A r3 
ttdefine I r4 
#define B r5 
#def ine J r6 
#define C r7 
tfdefine N r8 
ftdefine EFLAG r9 

/** 

Split complex parameters 
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**/ 

#define ArO A 

#define AiO rlO 

#define BrO B 

#define BiO rll 

#define Cr C 

#define Ci rl2 

/** 

Local registers 
** J 

#define count r4 
#define rtmpO r4 
#define rtmpl rl3 

#define dst stride rl3 
#define num_blocks rl4 
#define Arl r!3 
#define Ail rl4 
#define Ar2 r!5 
#define Ai2 rl6 
#define Ar3 rl7 
#define Ai3 rl8 

#define Brl rl9 
tfdefine Bil r20 
#define Br2 r21 
#define Bi2 r22 
#define Br3 r23 
#define Bi3 r24 
#define ptr offseto r25 
#define ptr offsetl r26 
#define addr incr r27 
#define dst rptr r28 
#define dst iptr r2 9 
#define dst_control r30 

/** 

VMX registers 

#define rsumr vO 
ttdefine rsumi vl 
#define isumr v2 
#define isumi v3 
#define rsumO v4 
#define rsuml v5 
#define isumO v6 
#define isumi v7 

#define arO v4 
#define aiO v5 
#define arl v6 
#define ail v7 
#define ar2 v8 
#define ai2 v9 
#define ar3 vlO 
#define ai3 vll 

#define brO vl2 
#define biO vl3 
#define brl vl4 
#define bil vl5 
#define br2 vi6 
#define bi2 vl7 
#define br3 vl8 
#define bi3 vl9 
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FPU registers 



**/ 






ttdefine 


far 


fO 


tfaeiine 


fbr 


fl 


#def ine 


fai 


f2 


#define 


fbi 


f3 


#def ine 


f rsumr 


f4 


^define 


frsumi 


f5 


#define 


fisumi 


f6 


fcdefine 


fisumr 


f7 


#def ine 


f rsum 


f8 


#def ine 


f isum 


f9 


#def ine 


rsum vmx f 10 


#define 


isum_vmx fll 


/** 







Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U_ENTRY( FUNC CONJ_ENTRY ) 
MR <r tmpO, Cr) 
MR (Cr, Ci) 
MR (Ci, rtmpO) 
MR(rtmp0, BrO) 
MR (BrO, BiO) 
MR (BiO, rtmpO) 

/** 

Here for normal inner product 

UJ3NTRY( FUNC ENTRY ) 
DECLARE fO fll 
DECLARE r3 r30 
DECLARE_vO_vl9 

J * * 

Initial setup code 
**/ 

SAVE rl3 r30 

USE THRU vl9( VREGSAVE_COND ) 
LFS( f rsumr, ArO, 0 ) 
FSUBS(f rsumr, f rsumr, f rsumr) 
FMR (frsumi , f rsumr) 
FMR (fisumr, f rsumr) 
FMR (fisumi, f rsumr) 
FMR (rsum vmx, f rsumr) 
FMR(isum_vmx, f rsumr) 

/ * * 

Process unaligned vector section first 
**/ ' 

LABEL ( SUFFIX ( cont ) ) 

GET_VMX UNALIGNED COUNT ( count, ArO ) 

LI( ptr offsetO, 0 ) 

BEQ( SUFFIX ( aligned ) ) 
^SUB( N, N, count ) /* adjust N for after loop */ 

Here to do first 1 to 3 points using standard FP 
^Store result for later post_loop processing 

LFSX( far, ArO, ptr offset 0 ) 
LFSX( fai, AiO, ptr_offset0 ) 
DECR C( count ) 
LFSX( fbr, BrO, ptr offsetO ) 
LFSX( fbi, BiO, ptr offsetO ) 
FMULS( f rsumr, far,~~fbr ) 
FMULS( frsumi, fai, fbi ) 
FMULS( fisumi, far, fbi ) 
FMULS( fisumr, fai, fbr ) 
ADDI ( ArO, ArO, 4 ) 
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) 



far, fbr, frsumr ) 
4 ) 

fai, fbi, frsumi ) 
4 ) 

far, fbi, fisumi ) 
4 ) 

fai, fbr, fisumr ) 



ADDI ( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 
, BEQ( SUFFIX ( aligned ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX < pre_loop ) ) 

LFSX( far, ArO, ptr offsetO ) 
LFSX( fai, AiO, ptr_offsetO ) 
DECR C( count ) 
LFSX( fbr, BrO, ptr offsetO ) 
LFSX( fbi, BiO, ptr offsetO ) 
FMADDS { frsumr, 
ADDI ( ArO, ArO, 
FMADDS { frsumi, 
ADDI ( AiO, AiO, 
FMADDS ( fisumi, 
ADDI ( BrO, BrO, 
FMADDS ( fisumr, 
ADDI ( BiO, BiO, 4 ) 
BNE ( SUFFIX ( pre_loop) ) 

/** 

Here for VMX aligned loop code 

Prepare for loop entry: assign loop pointers, counters 

LABEL ( SUFFIX ( aligned ) ) 
/** 

DST setup: bring in 2 cachelines 

MAKE STREAM_CODE( control_register, bytes_per_block, block_count, 
byte stride ) 
**/ " 

#if defined ( DST_ENABLE ) 

#if defined ( EXPAND_NCC ) 

MR ( dst rptr, Ar ) 

MR ( dst iptr, Ai ) 
#elif defined ( EXPAND_CNC ) 

MR ( dst rptr, Br ) 

MR( dst_iptr, Bi ) 
#endif 



MAKE STREAM CODE ( dst control, €4, 1, 0 } 
DSTA( dst rptr, dst control ) 
DSTA ( dst iptr, dst control ) 
DSTB ( dst rptr, dst control ) 
DSTB ( dst_iptr, dst control ) 
#endif 



SRWI C( count, N, NSHIFT ) /* 16 per trip */ 

LKaddr incr, ADDRESS INCREMENT) /* constants defined above */ 

SLWI(ptr offsetl, addr incr, 2) 

NEG(ptr_offsetl, ptr_offsetl) /* will be adding addrjincr « 3 */ 

ADD ( Arl , ArO, addr incr) 
VXOR( rsumr, rsumr, rsumr ) 
ADD (Brl , BrO, addr incr) 
ADD (Ail, AiO, addr incr) 
VXOR{ rsumi, rsumi, rsumi ) 
ADD (Bil, BiO , addr_incr) 

ADD ( Ar2 , Arl, addr incr) 
VXOR( isumr, isumr, isumr ) 
ADD (Br2, Brl, addr incr) 
ADD (Ai2, Ail, addr incr) 
VXOR( isumi, isumi, isumi ) 
ADD (Bi2 , Bil, addr_incr) 
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ADD (Ar3 , Ar2, addr incr) 

ADD (Br 3 , Br2, addr incr) 

ADD ( Ai3 , Ai2, addr incr) 

ADD (Bi3 , Bi2, addr incr) 

SLWI (addrjincr, addr_incr, 3) /* bump by 8 elements */ 

/** 

Loop entry code 

■k-k J 

DSTA{ dst rptr, dst control ) 
LOAD A( arO, ArO, ptr off set 0 ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD_B{ biO, BiO, ptr offsetO ) 
/** ~~ 

Top of double loop structure 
**/ 

LABEL ( SUFFIX (loopO ) ) 

LOAD A( arl, Arl, ptr offsetO ) 
VMADDFP ( rsumr, arO , brO, rsumr ) 
DSTA( dst iptr, dst control ) 
LOAD B( brl, Brl, ptr offsetO ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ail, Ail, ptr offsetO ) 
LOAD B( Ml, Bil, ptr offsetO ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2, ptr offsetO ) 
VMADDFP { isumi, arO, biO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2, ptr offsetO ) 
VMADDFP ( rsumr, arl, brl, rsumr ) 
ADD (ptr offsetl, ptr_offsetl, addrjlncr) 
VMADDFP { rsumi, ail, bil, rsumi ) 
LOAD A( ai2, Ai2, ptr offsetO ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( bi2, Bi2, ptr offsetO ) 
VMADDFP ( isumr, ail, brl, isumr ) 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
LOAD A{ ar3, Ar3 , ptr offsetO ) 
VMADDFP { rsumi, ai2, bi2, rsumi ) 
LOAD B( br3, Br3, ptr offsetO ) 
LOAD A( ai3, Ai3, ptr offsetO ) 
VMADDFP ( isumi, ar2 , bi2, isumi ) 
LOAD B( bi3, Bi3, ptr offsetO ) 
VMADDFP { isumr, ai2, br2, isumr ) 
BEQ( SUFFIX (loopO exit ) ) 
DSTA( dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetl ) 
VMADDFP ( rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
DSTB( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetl ) 
VMADDFP ( isumi, ar3, bi3, isumi ) 
LOAD A{ aiO, AiO, ptr offsetl ) 
LOAD B( biO, BiO, ptr offsetl ) 
VMADDFP ( isumr, ai3, br3, isumr ) 
BR{ SUFFIX (loopl ) ) 

/** 

loop exit 
** / 

LABEL ( SUFFIX(loopO exit ) ) 

MR (ptr offsetO, ptr offsetl) 

BR( SUFFIX (loopl exit ) ) 
/*★ ~~ 

Top of second loop 
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**/ 

LABEL ( SUFFIX (loopl ) ) 

LOAD A( arl, Arl, ptr offsetl } 
VMADDFP ( rsumr, arO f brO, rsumr ) 
DSTA ( dst iptr, dst control ) 
LOAD B( brl, Brl, ptr offset! ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ail. Ail, ptr offsetl ) 
LOAD B( bil, Bil, ptr offsetl ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2, ptr offsetl ) 
VMADDFP ( isumi, arO, MO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2, ptr offsetl ) 
VMADDFP { rsumr, arl, brl, rsumr ) 
ADD (ptr offsetO, ptr_offset0, addr_incr) 
VMADDFP ( rsumi, ail, bil, rsumi } 
LOAD A( ai2, Ai2, ptr offsetl ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B( bi2, Bi2, ptr offsetl ) 
VMADDFP ( isumr, ail, brl, isumr } 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
LOAD A( ar3, Ar3 , ptr offsetl ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
LOAD B( br3, Br3, ptr offsetl ) 
LOAD A( ai3, Ai3, ptr offsetl ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, ptr offsetl ) 
VMADDFP ( isumr, ai2, br2, isumr ) 
BEQ { SUFFIX (loopl exit ) ) 
DSTA ( dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offset 0 ) 
VMADDFP { rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
VMADDFP ( isumi, ar3, bi3, isumi ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD B( biO, BiO, ptr offsetO ) 
VMADDFP ( isumr, ai3, br3, isumr ) 
BR( SUFFIX (loopO ) ) 
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Drop out of loop, flush pipe 
**/ 

LABEL ( SUFFIX (loopl exit ) ) 

VMADDFP ( rsumr, ar3 , br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
VMADDFP ( isumi, ar3, bi3 , isumi ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 

Remaining sum updates 
**/ 

LABEL ( SUFFIX (twojLeft) ) 

ANDI_C{ count, N, 0x8 ) /* bit 3 */ 
BEQ ( SUFFIX (oneJLeft ) ) 

LOAD A( arO, ArO, ptr offsetO ) 

LOAD B( brO, BrO, ptr offsetO ) 

LOAD A( aiO, AiO, ptr offsetO ) 

LOAD__B( biO, BiO, ptr_offset0 ) 

LOAD A( arl, Arl, ptr offsetO ) 

LOAD B( brl, Brl, ptr offsetO ) 

LOAD A( ail. Ail, ptr offsetO ) 

LOAD_B( bil, Bil, ptr_offset0 ) 
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VMADDFP ( rsumr, arO, brO, rsumr ) 

VMADDFP { rsumi, aiO, biO, rsumi ) 

VMADDFP ( isumi, arO, biO, isumi ) 

VMADDFP { isumr, aiO, brO, isumr ) 

VMADDFP { rsumr, arl, brl, rsumr ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
VMADDFP ( isumi, arl, bil, isumi ) 
VMADDFP ( isumr, ail, brl, isumr ) 
ADDI ( ptr_offsetO, ptr_offsetO, 32 ) 

LABEL { SUFFIX (one_left) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 

BEQ{ SUFFIX (combine ) ) 

LOAD A( arO, ArO, ptr offsetO ) 

LOAD B( brO, BrO , ptr offsetO ) 

LOAD A{ aiO, AiO, ptr offsetO ) 

LOAD B{ biO, BiO, ptr offsetO ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 

VMADDFP { rsumi, aiO, biO, rsumi ) 

VMADDFP ( isumi, arO, biO, isumi ) 

VMADDFP ( isumr, aiO, brO, isumr ) 

ADDI ( ptr offsetO, ptr offsetO, 16 ) 
/** ~~ 

combine partial sums, permute, write out results 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - rsumi */ 

VADDFP ( isumi, isumi, isumr ) 
/ ** 

8 bytes/cycle shuffle: 
^real/imag logic should be intermixed for efficiency 

VMRGHW (rsumO, rsumr, rsumr) 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW(isumO, isumi, isumi) 
VMRGLW ( rsumi , rsumr , rsumr) 

SUB( addr incr, N, addr incr ) /* offset index for remainders */ 
VMRGLW (isumi, isumi, isumi) 
VADDFP ( rsumO, rsumi, rsumO ) 

SLWI(addr incr, addr incr, 2) /+ byte offset */ 
VADDFP ( isumO, isumi, isumO ) 

VMRGHW { rsumi , rsumO, rsumO) 
ADD (ArO, ArO, addr incr) 
VMRGHW (isumi, isumO, isumO) 
ADD (AiO, AiO, addr incr) 
VMRGLW ( rsumO , r sumO , rsumO ) 
ADD (BrO, BrO, addr incr) 
VMRGLW (isumO, isumO, isumO) 
ADD (BiO, BiO, addr incr) 
VADDFP ( rsumr, rsumi, rsumO ) 
LI (ptr offsetO, 0) /* needed for output */ 
VADDFP ( isumi, isumi, isumO ) 
/** 

4 byte stores 
**/ 

STVEWX( rsumr, Cr, ptr offsetO ) 
STVEWX( isumi, Ci, ptr offsetO ) 
/** ~~ 

Remainders of 1-3 more to do 
**/ 

ANDI_C( N, N, 3 ) 

LFS( rsum vmx, Cr, 0 ) 

LFS ( isum vmx, Ci, 0 ) 

BEQ( SUFFIX ( scaler_vmx_combine ) ) 
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Here to do last 1-3 points using standard FP 
LABEL ( SUFFIX ( post_loop ) ) 



LFS{ far, ArO, 0 

LFS( fai, AiO, 0 
DECR_C( N ) 

LFS( fbr, BrO, 0 

LFS( fbi, BiO, 0 



FMADDS ( frsumr, far, fbr, frsumr ) 
FMADDS ( frsumi, fai, fbi, frsumi ) 
FMADDS ( fisumi, far, fbi, fisumi ) 
FMADDS ( fisumr, fai, fbr, fisumr ) 
ADDI (ArO , ArO, 4) 
ADDI (BrO , BrO, 4) 
ADDI (AiO, AiO, 4) 
ADDI (BiO , BiO, 4) 
BNE( SUFFIX { post_loop) ) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vtroc combine ) ) 

FSUBS( frsum, frsumr, frsumi ) /* rsumr = rsumr - rsumi */ 

FADDS ( fisum, fisumi, fisumr ) 

FADDS ( frsum, frsum, rsum vmx ) 

FADDS ( fisum, fisum, isum_vmx ) 

STFS( frsum, Cr, 0 ) 

STFS( fisum, Ci, 0 ) 
/** 

return 
**/ 

LABEL ( SUFFIX (ret ) ) 

FREE THRU Vl9 ( VREGSAVE_COND } 

REST rl3_r30 

RETURN 
FUNC EPILOG 
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#define ZDOTPR 0 
#define ZIDOTPR 1 
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— MC Standard Algorithms PPC Macro language Version 



File Name: ZDOTPR. MAC 

Description: Vector Single Precision Complex Dot Product 
Entry /params : ZDOTPR (A, I, B, J, C, N) 
Formula: C[0] = sum (A->realp [ml] *B~>realp [mJ] 

- A- >imagp [ml ] *B- >imagp [mJ] ) 
C[l] = sum (A->realp[ml3 *B->imagp [mJ] 

+ A->imagp[mI3*B->realp[mJJ ) 
for m=0 to N-l 



Revision 

0.0 
0.1 
0.1 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 

Date Engineer Reason 

981209 fpl Created (from cdotpr.mac) 

990310 fpl 750/G4 integration 

990322 fpl Stylistic changes 



#define COMPILE_ESAL_JUMP_TABLE 

#define FUNC_TYPE ZDOTPR 

#if defined ( BUILDJ4AX ) 

#undef VMX SAL 
#undef VMX NN 
#undef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if I defined ( COMPILE ESAL JUMP TABLE ) II defined ( 
COMP I LE_NO_ESAL_ JUMP_TABLE~ ) 

„ /* 1 variant: zdotpr vmx() */ 

#def ine VMX SAL 
#include "zdotpr^vmx.k" 



#else 

#def ine VMX NN 
^include " zdotpr_vmx . k " 

#undef VMX NN 
#def ine VMX NC 
#include !, zdotpr_vmx.k M 

#undef VMX NC 
#define VMX CN 
#include "zdotprjvmx.k" 

#undef VMX CN 
#define VMX CC 
#include "zdotprjvmx.k" 
#undef VMX_CC 
#endif 

ttendif 



/* 5 variants based on ESAL flag */ 



/* end COMPILE_ESAL_JUMP_TABLE */ 
/* end BUILD_MAX */ 
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Wireless Communication Systems And Methods For Long-code Com- 
munications For Regenerative Multiple User Detection Involving 
Implicit Waveform Subtraction 

1. In a spread spectrum communication system of the type that processes one or more 
spread-spectrum waveforms ("user spread-spectrum waveforms"), each representative 
of a waveform associated with a respective user, the improvement comprising: 

a first logic element that generates a residual composite spread-spectrum waveform as 
a function of a composite spread-spectrum waveform and an estimated composite 
spread-spectrum waveform, 

one or more second logic elements each coupled to the first logic element, each second 
logic element generating a refined matched-filter detection statistic for at least a selected 
user as a function of 

(i) the residual composite spread-spectrum waveform and 

(ii) a characteristic of an estimate of the selected user's spread-spectrum 
waveform. 

2. In the system of claim 1, the further improvement wherein the characteristic is at least 
one of an estimated amplitude and an estimated symbol associated with the estimate of 
the selected user's spread-spectrum waveform. 

3. In the system of claim 1, the improvement wherein the spread-spectrum communica- 
tions system comprises a code division multiple access (CDMA) base station. 

4. In the system of claim 1, the improvement wherein the CDMA base station comprises 
one or more long-code receivers, and each long-code receiver generating one or more 
respective matched-filter detection statistics, from which the estimated composite 
spread-spectrum waveform is, in part, generated. 

5. In the system of claim 1, the improvement wherein the first logic element comprises 
summation logic which generates the residual composite spread-spectrum waveform 
based on the relation 
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wherein 

r m[f] is the residual composite spread-spectrum waveform, 

r[t] represents the composite spread-spectrum waveform, 

r n \t] represents the estimated composite spread-spectrum waveform, 

t is a sample time period, and 

n is an iteration count. 

In the system of claim 5, the further improvement wherein the estimated composite 
spread-spectrum waveform is pulse-shaped and is based on estimated complex ampli- 
tudes, estimated delay lags, estimated symbols, and codes of the one or more user 
spread-spectrum waveforms. 

In the system of claim 1, the further improvement wherein each second logic element 
comprises rake logic and summation logic which generates the refined matched-filter 
detection statistics based on the relation 

wherein 

A" represents an amplitude statistic, 

b { ^[m\ represents a soft symbol estimate for the A* user for the m th symbol 
period , 

ylZd™] represents a residual matched-filter detection statistic for the P user, 
and 

n is an iteration count. 

In the system of claim 1, the further improvement wherein the refined matched-filter 
detection statistic for each user is iteratively generated. 
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9. In the system of claim 1, the further improvement wherein the refined matched-filter 
detection statistic for at least a selected user is generated by a long-code receiver. 

10. In the system of claim 1, the improvement wherein the first and second logic elements 
are implemented on any of processors, field programmable gate arrays, array proces- 
sors and co-processors, or any combination thereof. 

11. In a spread spectrum communication system of the type that processes one or more user 
spread-spectrum waveforms, each representative of a waveform associated with a 
respective user, the improvement comprising: 

a first logic element which generates an estimated composite spread-spectrum wave- 
form that is a function of estimated user complex channel amplitudes, time lags, and 
user codes, 

a second logic element coupled to the first logic element, the second logic element gen- 
erating a residual composite spread-spectrum waveform a function of a composite user 
spread-spectrum waveform and the estimated composite spread-spectrum waveform, 

one or more third logic elements each coupled to the second logic element, the third 
logic element generating a refined matched-filter detection statistic for at least a selected 
user as a function of 

(i) the residual composite spread-spectrum waveform and 

(ii) a characteristic of an estimate of the selected user's spread-spectrum 
waveform. 



12. In the system of claim 1 1 , the further improvement wherein the characteristic is at least 
one of an estimated amplitude, an estimated delay lag and an estimated symbol associ- 
ated with the estimate of the selected user's spread-spectrum waveform. 

13. In the system of claim 11, the improvement wherein the spread-spectrum communica- 
tions system is a code division multiple access (CDMA) base station. 

14. In the system of claim 13, the improvement wherein the CDMA base station comprises 
long-code receivers. 



517 



WO 02/073937 



PCT7US02/08106 



15. In the system of claim 1 1 , the improvement wherein the first logic element further com- 
prises arithmetic logic which generates the estimated composite spread-spectrum 
waveform based on the relation 

r 

wherein 

r n) [t] represents the estimated composite spread-spectrum waveform, 
g[t] represents a raised-cosine pulse shape. 

1 6. In the system of claim 15, the further improvement wherein the first logic element com- 
prises arithmetic logic which generates an estimated composite re-spread waveform 
based on the relation 



P^W-ttZ^-^-^J-Aff^W-^lLr/JViJl 

9 

wherein 

K v is a number of simultaneous dedicated physical channels for all users, 
8[>] is a discrete-time delta function, 

% is an estimated complex channel amplitude for the multipath component 
for the A* user, 

c k [r] represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels, 

h[ n) [m] represents a soft symbol estimate for the A* user for the m th symbol 
period, 

is an estimated time lag for the /> th th multipath component for the A* user , 
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N k is a spreading factor for the user, 
/ is a sample time index, 
L is a number of multi-path components., 
N c is a number of samples per chip, and 
wis an iteration count 



17. In the system of claim 11, the improvement wherein the second logic element com- 
prises summation logic which generates the residual composite spread-spectrum wave- 
form that based on the relation 

wherein 

r r*[t] is the residual composite spread-spectrum waveform , 

r[t] represents the composite spread-spectrum waveform, 

r I* J represents the estimated composite spread-spectrum waveform, 

/ is a sample time period, and 

n is an iteration count. 

18. In the system of claim 17, the further improvement wherein the estimated composite 
spread-spectrum waveform is pulse-shaped and is based on the user spread-spectrum 
waveform. 

19. In the system of claim 18, the further improvement wherein each third logic element 
comprises rake logic and summation logic which generates the second user matched- 
filter detection statistic based on the relation 
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wherein 

A[ n)2 represents an amplitude statistic, 

h^\ni\ represents a soft symbol estimate for the user for the symbol 
period, 

yluA m ] represents the user residual matched-filter detection statistic for the 
symbol period, and 

n is an iteration count 

20. In the. system of claim 1 1, the further improvement wherein the refined matched-filter 
detection statistic for each user is iteratively generated. 

21. In the system of claim 1 1 , the improvement wherein the logic elements are implemented 
on any of a processors, field programmable gate arrays, array processors and co-proces- 
sors, or any combination thereof. 

22. A method for multiple user detection in a spread-spectrum communication system that 
processes long-code spread-spectrum user transmitted waveforms comprising: 

generating a residual composite spread-spectrum waveform as a function of an arithme- 
tic difference between a composite spread-spectrum waveform and an estimated 
spread-spectrum waveform, 

generating a refined matched-filter detection statistic that is a function of a sum of a 
rake-processed residual composite spread-spectrum waveform for a selected user and 
an amplitude statistic for that selected user. 

23. The method of claim 22, comprising generating a refined matched-filter detection sta- 
tistic that is a function of a sum of a rake-processed residual composite spread-spectrum 
waveform for a selected user and an amplitude statistic for that selected user multiplied 
by a soft symbol estimate. 
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24. The method of claim 22, further wherein the spread-spectrum communications system 
is a code division multiple access (CDMA) base station. 

25. The method of claim 22, wherein the step of generating the residual composite spread- 
spectrum waveform further comprises performing arithmetic logic that is based on the 
relation 

wherein 

r «W is the residual composite spread-spectrum waveform , 

r[t] represents the composite spread-spectrum waveform, 

r l/J represents the estimated composite spread-spectrum waveform, 

t is a sample time period, and 

n is an iteration count. 

26. The method of claim 22, wherein the estimated composite spread-spectrum waveform 
is pulse-shaped and is based on a composite user re-spread waveform. 

27. The method of claim 22, wherein the step of generating the refined matched-filter 
detection statistic representative of that user further comprises performing arithmetic 
logic based on the relation 

wherein 

represents an amplitude statistic, 

represents a soft symbol estimate for the A* user for the m*** symbol 
period, 

y™A m 1 represents a residual matched-filter detection statistic, and 
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n is an iteration count. 

28. The method of claim 22, the further improvement wherein the refined matched-filter 
detection statistic is generated by a long-code receiver. 

29. The method of claim 22, the further improvement wherein the step of generating the 
residual matched-filter detection statistic for an symbol period comprises perform- 
ing arithmetic logic based on the relation 

wherein 

yla,k M represents the user residual matched-filter detection statistic for the rrP 
symbol period, 

L is a number of multi-path components, 

a£ is the estimated complex channel amplitude for the /7 th multipath compo- 
nent for the user, 

N k is the spreading factor for the user, 

r ra M is the residual composite spread-spectrum waveform , 

N c is the number of samples per chip, and 

is the time lag for the /7 th multipath component for the A* user , 

m is a symbol period, 

T k is a channel symbol duration for the A* user, 

Cjtm M represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels. 

n is an iteration count. 
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Wireless Communication Systems And Methods For Long-code Com- 
munications For Regenerative Multiple User Detection Involving 
Matched-filter Outputs 

30. In a spread spectrum communication system of the type that processes one or more 
spread-spectrum waveforms ("user spread-spectrum waveforms"), each representative 
of a waveform associated with a respective user, the improvement comprising: 

a first logic element which generates an estimated composite spread-spectrum wave- 
form that is a function of one or more of estimated complex amplitudes, estimated time 
lags, estimated symbols, and codes of the one or more user spread-spectrum wave- 
forms, 

one or more second logic elements each coupled to the first logic element, the one or 
more second logic elements generating a second matched-filter detection statistic for at 
least a selected user as a function of a difference between a first matched-filter detection 
statistic for that user and an estimated matched-filter detection statistic for that user as 
a function of the estimated composite spread-spectrum waveform. 

31. In the system of claim 30, the further improvement wherein the one or more second 
logic elements generate the second matched-filter detection statistic for at least the 
selected user as a function of a difference between (i) a sum of the first matched-filter 
detection statistic for that user and a characteristic of an estimate of the selected user's 
spread-spectrum waveform and (ii) the estimated matched-filter detection statistic for 
that user. 

32. In the system of claim 3 1 , the further improvement wherein the characteristic is at least 
one of an estimated amplitude and an estimated symbol associated with the estimate of 
the selected user's spread-spectrum waveform. 

33. In the system of claim 30, the improvement wherein the spread-spectrum communica- 
tions system is a code division multiple access (CDMA) base station. 

34. In the system of claim 33, the improvement wherein the CDMA base station comprises 
long-code receivers. 
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35. In the system of claim 30, the improvement further wherein the first logic element com- 
prises arithmetic logic which generates, an estimated composite re-spread waveform 
based on the relation 

p (B) w=t:sE*-^ ) -^]-c- G *w-^ ) iL'- / ^j] s 

k=] p=l r 

wherein 

K v is a number of simultaneous dedicated physical channels for all users, 
5[/] is a discrete-time delta function, 

a£ is an estimated complex channel amplitude for the /?* multipath component 
for the Abuser, 

c k [r] represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels, 

b[ n) [m] represents a soft symbol estimate for the Abuser for the w* symbol 
period, 

x£ is an estimated time lag for the p* 1 multipath component for the user , 

N k is a spreading factor for the user, 

t is a sample time index, 

L is a number of multi-path components., 

N c is a number of samples per chip, and 

n is an iteration count. 

36. In the system of claim 35, the improvement wherein the first logic element further com- 
prises arithmetic logic which generates the estimated composite spread-spectrum 
waveform based on the relation 
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r 

wherein 

r in) [t] represents the estimated composite spread-spectrum waveform, and 
g[t] represents a pulse shape. 

37. In the system of claim 30, the improvement further wherein the estimated composite 
residual spread-spectrum waveform is pulse-shaped and is based on the user spread- 
spectrum waveform. 

38. In the system of claim 30, the improvement further wherein each second logic element 
comprises rake logic and summation logic which generates the second matched-filter 
detection statistic based on the relation 

wherein 

4fc" )2 represents an amplitude statistic, 

hf\m] represents a soft symbol estimate for the A* user for the ;wth symbol 
period , 

yl n) [m] represents the first matched-filter detection statistic for the selected 
user, 

y ( e£,ki m ] represents the estimated matched-filter detection statistic for the 
selected user, and 

n is an iteration count. 

39. In the system of claim 38, the improvement further wherein the system generates the 
second matched-filter detection statistic for the selected user and zero, one or more 
further second matched-filter detection statistics for that user iteratively. 



» 
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40. In the system of claim 30, the improvement further wherein the first matched-filter 
detection statistic for at least the selected user is generated by a long-code receiver. 

41. In the system of claim 30, the improvement wherein the logic elements are imple- 
mented on any of a processors, field programmable gate arrays, array processors and 
co-processors, or any combination thereof. 

42. In a method for multiple user detection in a spread-spectrum communication system 
that processes long-code spread-spectrum user waveforms, the improvement compris- 
ing a method of generating user matched-filter detection statistics for at least a selected 
user comprising: 

generating a composite spread-spectrum waveform as a function of a puised-shaped 
composite re-spread waveform, 

generating a refined user matched-filter detection statistic for at least the selected user 
that is a function of a difference between a first matched-filter detection statistic for that 
user and an estimated matched-filter detection statistic for that user. 

43. In the method of claim 42, the further improvement comprising generating the refined- 
matched-filter detection statistic for at least the selected user as a function of a differ- 
ence between (i) the sum of the first matched-filter detection statistic for that user and a 
characteristic of an estimate of the selected user's spread-spectrum waveform and (ii) 
the estimated matched-filter detection statistic for that user. 

44. In the method of claim 43 , the further improvement wherein the characteristic is at least 
one of an estimated amplitude, and an estimated symbol associated with an estimate of 
the selected user's spread-spectrum waveform. 

45. In the method of claim 42, further wherein the spread-spectrum communications 
system is a code division multiple access (CDMA) base station. 

46. In the method of claim 42, wherein the step of generating the composite spread-spec- 
trum waveform further comprises a function of a composite signal representing the sum 
of all estimated user waveforms. 
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47. In the method of claim 46, the improvement further wherein the function of the com- 
posite signal representing the sum of all estimated user waveforms comprises a pulse- 
shaping filter. 

48. In the method of claim 42, wherein the step of generating the second matched-filter 
detection statistic representative of that user further comprises performing arithmetic 
logic based on the relation 

wherein 

A { k n)2 represents an amplitude statistic, 

h k n) [m] represents a soft symbol estimate for the A* user for the mth symbol 
period, 

y[ n) [m] represents the first matched-filter detection statistic, 
y { e n J tk [m] represents the estimated matched-filter detection statistic, and 
n is an iteration count. 

49. In the method of claim 48, the further improvement wherein second matched-filter 
detection statistic is derived from the estimated composite spread-spectrum waveform 
based on the relation 

>£?, [m] = Re{£a« s ~f t**W. + 4»+ mT k ] ■ £[r] - 
wherein 

L is a number of multi-path components, 

a { jj is an estimated complex channel amplitude for the /?* multipath component 
for the A* user, 

N k is a spreading factor for the ft* user, 
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r " [t] represents the estimated com- 

posite spread-spectrum waveform, 

N c is a number of samples per chip, and 

is an estimated time lag for the multipath component for the P user, 

m is a symbol period, 

T k is a data bit duration, 

n is an iteration count and 

c jbnM represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels. 

Wireless Communication Systems And Methods For Long-code Com- 
munications For Regenerative Multiple User Detection Involving Pre- 
maximal Combination Matched Filter Outputs 

50. In a spread spectrum communication system of the type that processes one or more 
spread-spectrum waveforms ("user spread-spectrum waveforms"), each representative 
of a waveform received from a respective user, the improvement comprising: 

one or more first logic elements generating a first complex channel amplitude estimate 
corresponding to at least a selected user and at least a selected finger of a rake receiver 
that receives the selected user waveform. 

one or more second logic elements each coupled to one or more first logic elements, 
each generating an estimated composite spread-spectrum waveform that is a function of 
one or more of estimated complex channel amplitudes, estimated delay lags, estimated 
symbols, and/or codes of the one or more user spread-spectrum waveforms, 

one or more third logic elements each coupled to one or more second logic elements, 
the one or more third logic elements generating a second pre-combination matched- 
filter detection statistic for at least a selected user and for at least a selected finger as a 
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52. 

53. 

54. 
55. 
56. 



function of a first pre-combination matched-filter detection statistic for that user and a 
pre-combination estimated matched-filter detection statistic for that user. 

In the system of claim 50, the further improvement comprising 

one or more fourth logic elements, each coupled to one or more third logic elements, 
the fourth logic element generating a second complex channel amplitude estimate cor- 
responding to at least a selected user and at least selected finger. 

In the system of claim 50, the further improvement wherein the one or more third logic 
elements generate the second pre-combination matched-filter detection statistic for at 
least the selected user and at least the selected finger as a function of a difference 
between (i) the sum of the first pre-combination matched-filter detection statistic for 
that user and that finger and a characteristic of an estimate of the selected user's spread- 
spectrum waveform and (ii) the pre-combination estimated matched-filter detection 
statistic for that user and that finger. 

In the system of claim 52, the further improvement wherein the characteristic is at least 
one of an estimated amplitude and an estimated symbol associated with the estimate of 
the selected user's spread-spectrum waveform. 

In the system of claim 50, the improvement wherein the spread-spectrum communica- 
tions system is a code division multiple access (CDMA) base station. 

In the system of claim 54, the improvement wherein the CDMA base station comprises 
long-code receivers. 

In the system of claim 50, the improvement further wherein the first and fourth logic 
elements comprise arithmetic logic which generate a complex channel amplitude esti- 
mate coiTesponding to at least a selected user and at least a selected finger of a rake 
receiver that receives the selected user waveform based on the relation 




wherein 



# *p is a complex channel amplitude estimate corresponding to the finger of 
the A 4 * 1 user, 
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w[>] is a filter, 

N p is a number of symbols, 

J>£ } M is a first pre-combination matched-filter detection statistic correspond- 
ing to the p th finger of the 4 th user for the symbol period, 

M is a number of symbols per slot, 

£ ( / } [m] represents a soft symbol estimate for the P user for the 777 th symbol 
period, 

m is a number symbol period index, 
s is a slot index, and 
n is an iteration count 

57. In the system of claim 50, the improvement further wherein the second logic element 
comprises arithmetic logic which generates an estimated composite re-spread wave- 
form based on the relation 

wherein 

K v is a number of simultaneous dedicated physical channels for all users, 
5[t] is a discrete-time delta function, 

a ( £ is an estimated complex channel amplitude for the />* multipath component 
for the A* user, 

C *M represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels, 
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b ( ^[m] represents a soft symbol estimate for the Abuser for the symbol 
period, 

x£ is an estimated time lag for the multipath component for the P user , 

N k is a spreading factor for the P user, 

t is a sample time index, 

L is a number of multi-path components., 

N c is a number of samples per chip, and 

it is an iteration count. 

58. In the system of claim 57, the improvement wherein the second logic element further 
comprises arithmetic logic which generates the estimated composite spread-spectrum 
waveform based on the relation 

r 

wherein 

r in) [t] represents the estimated composite spread-spectrum waveform, 
g[t] represents a pulse shape. 

59. In the system of claim 50, the improvement further wherein the estimated composite 
residual spread-spectrum waveform is pulse-shaped and is based on the user spread- 
spectrum waveform. 

60. In the system of claim 50, the improvement further wherein each third logic element 
comprises rake logic and summation logic which generates the second pre-combination 
matched-filter detection statistic based on the relation 

<>[/»] S a« • *JV] + - [,„] , 
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wherein 

represents the pre-combination matched-filter detection statistic for 
the /7 th finger for the A* user for the m* symbol period, 

is the complex channel amplitude for the p 0 * finger for the A* user, 

b { ^[m] represents a soft symbol estimate for the user for the 777 th symbol 
period, 

y^frw] represents the first pre-combination matched-filter detection statistic 
for the finger for the i* user for the symbol period, 

y&,*p[ m ] represents the pre-combination estimated matched-filter detection 
statistic for the finger for the A* user for the symbol period, 
and 

n is an iteration count 

61. In the system of claim 60, the improvement further wherein the system generates the 
second pre-combination matched-filter detection statistic for the selected user and 
finger and zero, one or more further matched-filter detection statistics for that user and 
finger iteratively. 

62. In the system of claim 55 A, the improvement further wherein the system generates the 
second complex channel amplitude estimates for the selected user and finger and zero, 
one or more further complex channel amplitude estimates for that user and finger itera- 
tively. 

63. In the system of claim 50, the improvement further wherein the first pre-combination 
matched-filter detection statistic for at least the selected user and finger is generated by 
a long-code receiver. 

64. In the system of claim 50, the improvement wherein the logic elements are imple- 
mented on any of a processors, field programmable gate arrays, array processors and 
co-processors, or any combination thereof. 
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65. In a method for multiple user detection in a spread-spectrum communication system 
that processes long-code spread-spectrum user waveforms, the improvement compris- 
ing a method of generating user pre-combination matched-filter detection statistics for 
at least a selected user and finger comprising: 

generating a composite spread-spectrum waveform as a function of a pulsed-shaped 
composite re-spread waveform, 

generating a second user pre-combination matched-filter detection statistic for at least 
the selected user and finger that is a function of a difference between a first pre-combi- 
nation matched-filter detection statistic for that user and finger and a pre-combination 
estimated matched-filter detection statistic for that user and finger. 

66. In the method of claim 65, the furthet improvement comprising generating the second 
pre-combination matched-filter detection statistic for at least the selected user and 
finger as a function of a difference between (i) the sum of the first pre-combination 
matched-filter detection statistic for that user and finger and a characteristic of an esti- 
mate of the selected user's spread-spectrum waveform and (ii) the pre-combination 
estimated matched-filter detection statistic for that user. 

67. In the method of claim 66, the further improvement wherein the characteristic is at least 
one of an estimated amplitude, and an estimated symbol associated with an estimate of 
the selected user's spread-spectrum waveform. 

68. In the method of claim 65, further wherein the spread-spectrum communications 
system is a code division multiple access (CDMA) base station. 

69. In the method of claim 65, wherein the step of generating the composite spread-spec- 
trum waveform further comprises a function of a composite signal representing the sum 
of all estimated user waveforms. 

70. In the method of claim 69, the improvement further wherein the function of the com- 
posite signal representing the sum of all estimated user waveforms comprises a pulse- 
shaping filter. 

71. In the method of claim 16, wherein the step of generating the second pre-combination 
matched-filter detection statistic representative of that user and finger further comprises 
performing arithmetic logic based on the relation 
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wherein 

represents the pre-combination matched-filter detection statistic for 
the finger for the A* user for the /77 th symbol period, 

a ( £ is the complex channel amplitude for the finger for the A* user, 

b^\m\ represents a soft symbol estimate for the 1& user for the symbol 
period, 

J^t™] represents the first pre-combination matched-filter detection statistic 
for the /7 th finger for the IP user for the 772 th symbol period, 

y& tkp [ni\ represents the pre-combination estimated matched-filter detection 
statistic for the /7 th finger for the 4 th user for the tw* symbol period, 
and 

72. In the method of claim 71, the further improvement wherein second pre-combination 
matched-filter detection statistic is derived from the estimated composite spread-spec- 
trum waveform based on the relation 

j4?*M - Z + mr k ] • 4m , 

wherein 

N k is a spreading factor for the # h user, 

r <n) [^] represents the estimated composite spread-spectrum waveform, 

N c is a number of samples per chip, and 

* 00 

i kp is an estimated time lag for the multipath component for the P user , 
m is a symbol period, 
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T k is a data bit duration, and 
n is an iteration count. 

c kmi r ] represents a user code comprising at least a scrambling code, an orthogo- 
nal variable spreading factor code, and a j factor associated with even 
numbered dedicated physical channels. 
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